acmiyaguchi/Additional Churn Fields - Validation.ipynb Secret

## Additional Churn Fields - Validation.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              Additional Churn Fields - Validation.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## Additional Churn Fields - Validation.py

# coding: utf-8

# # Churn - Adding additional fields
#
# This report explores the differences between two versions of the churn dataset, corresponding to an unmodified dataset and one with additional fields for bugs [1323598](https://bugzilla.mozilla.org/show_bug.cgi?id=1323598) and [1337037](https://bugzilla.mozilla.org/show_bug.cgi?id=1337037). The two versions of the datasets are functionally equivalent to each other if properly handled.
#
# Adding new attributes/columns to this dataset requires recalculating aggregated values. In the case of churn, the following fields are aggregated from a set of similar clients.
#
# ```
# |-- n_profiles: long (nullable = true)
# |-- usage_hours: double (nullable = true)
# |-- sum_squared_usage_hours: double (nullable = true)
# ```
#
# All analysis and reports utilizing this dataset should explicitly aggregate by grouping by relevant columns.
#
# ## Setup
#
# The test data is generated with a 1% sample of main_summary corresponding to `sample_id == 0` for the 1 week churn period starting on `20170205`.
#
# The control dataset is generated from the churn notebook corresponding to [a pinned version of mozilla-reports](https://github.com/mozilla/mozilla-reports/commit/2de3ef16e95ea1b2977d42a7e7a5b9a0aec94f59). The new dataframe is generated through a patched version of the above [rehosted via this gist](https://gist.github.com/acmiyaguchi/cd17a0211ea8c026b44f8da73e84c190). A pull-request will be available soon.

# In[1]:

import pyspark.sql.functions as F

# hosted location of generated datasets
bucket = "net-mozaws-prod-us-west-2-pipeline-analysis"
prefix = "amiyaguchi/test_churn"

old_df = spark.read.parquet("s3://{}/{}/original".format(bucket, prefix))
new_df = spark.read.parquet("s3://{}/{}/updated".format(bucket, prefix))


# ## Observations
#
# The updated churn dataset adds a total of seven fields to the schema, with four fields for stub attribution and three fields for search retention.

# In[2]:

print("Schema for original churn dataset")
old_df.printSchema()
print("Schema for updated churn dataset")
new_df.printSchema()


# The updated dataset contains almost twice as many fields as the original. This corresponds to a finer granularity of aggregates over seven extra fields. However, note that the number of rows is exactly the same if we aggregate over the original set of columns, 314106.

# In[3]:

print("Original row count: {}".format(old_df.count()))
print("New row count: {}".format(new_df.count()))


# In[4]:

original_columns = old_df.columns[:-4]
(
    new_df
     .groupby(original_columns)
     .agg(F.sum("n_profiles").alias("n_profiles"))
).count()


# Additionally, we verify that the totals for both datasets are the same.

# In[5]:

# Show that both dataframes have the same total aggregates
def total_aggregates(df):
    agg_cols = ["n_profiles", "usage_hours", "sum_squared_usage_hours"]
    df.select([F.sum(x) for x in agg_cols]).show()

print("Total aggregates for original dataframe")
total_aggregates(old_df)
print("Total aggregates for updated dataframe")
total_aggregates(new_df)


# ## Discussion
#
# These observations should suffice for proving that the datasets are equivalent, aside from their level of granularity.
#
# This notebook is not a replacement for unit tests. However, this should explain the differences in how the churn dataset is generated.

	# coding: utf-8

	# # Churn - Adding additional fields
	#
	# This report explores the differences between two versions of the churn dataset, corresponding to an unmodified dataset and one with additional fields for bugs [1323598](https://bugzilla.mozilla.org/show_bug.cgi?id=1323598) and [1337037](https://bugzilla.mozilla.org/show_bug.cgi?id=1337037). The two versions of the datasets are functionally equivalent to each other if properly handled.
	#
	# Adding new attributes/columns to this dataset requires recalculating aggregated values. In the case of churn, the following fields are aggregated from a set of similar clients.
	#
	# ```
	# \|-- n_profiles: long (nullable = true)
	# \|-- usage_hours: double (nullable = true)
	# \|-- sum_squared_usage_hours: double (nullable = true)
	# ```
	#
	# All analysis and reports utilizing this dataset should explicitly aggregate by grouping by relevant columns.
	#
	# ## Setup
	#
	# The test data is generated with a 1% sample of main_summary corresponding to `sample_id == 0` for the 1 week churn period starting on `20170205`.
	#
	# The control dataset is generated from the churn notebook corresponding to [a pinned version of mozilla-reports](https://github.com/mozilla/mozilla-reports/commit/2de3ef16e95ea1b2977d42a7e7a5b9a0aec94f59). The new dataframe is generated through a patched version of the above [rehosted via this gist](https://gist.github.com/acmiyaguchi/cd17a0211ea8c026b44f8da73e84c190). A pull-request will be available soon.

	# In[1]:

	import pyspark.sql.functions as F

	# hosted location of generated datasets
	bucket = "net-mozaws-prod-us-west-2-pipeline-analysis"
	prefix = "amiyaguchi/test_churn"

	old_df = spark.read.parquet("s3://{}/{}/original".format(bucket, prefix))
	new_df = spark.read.parquet("s3://{}/{}/updated".format(bucket, prefix))


	# ## Observations
	#
	# The updated churn dataset adds a total of seven fields to the schema, with four fields for stub attribution and three fields for search retention.

	# In[2]:

	print("Schema for original churn dataset")
	old_df.printSchema()
	print("Schema for updated churn dataset")
	new_df.printSchema()


	# The updated dataset contains almost twice as many fields as the original. This corresponds to a finer granularity of aggregates over seven extra fields. However, note that the number of rows is exactly the same if we aggregate over the original set of columns, 314106.

	# In[3]:

	print("Original row count: {}".format(old_df.count()))
	print("New row count: {}".format(new_df.count()))


	# In[4]:

	original_columns = old_df.columns[:-4]
	(
	new_df
	.groupby(original_columns)
	.agg(F.sum("n_profiles").alias("n_profiles"))
	).count()


	# Additionally, we verify that the totals for both datasets are the same.

	# In[5]:

	# Show that both dataframes have the same total aggregates
	def total_aggregates(df):
	agg_cols = ["n_profiles", "usage_hours", "sum_squared_usage_hours"]
	df.select([F.sum(x) for x in agg_cols]).show()

	print("Total aggregates for original dataframe")
	total_aggregates(old_df)
	print("Total aggregates for updated dataframe")
	total_aggregates(new_df)


	# ## Discussion
	#
	# These observations should suffice for proving that the datasets are equivalent, aside from their level of granularity.
	#
	# This notebook is not a replacement for unit tests. However, this should explain the differences in how the churn dataset is generated.