Skip to content

Instantly share code, notes, and snippets.

@acmiyaguchi
Last active February 24, 2017 01:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save acmiyaguchi/f21a92b2980e177ab7fc4468c0c55074 to your computer and use it in GitHub Desktop.
Save acmiyaguchi/f21a92b2980e177ab7fc4468c0c55074 to your computer and use it in GitHub Desktop.
Additional Churn Fields - Validation
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
# coding: utf-8
# # Churn - Adding additional fields
#
# This report explores the differences between two versions of the churn dataset, corresponding to an unmodified dataset and one with additional fields for bugs [1323598](https://bugzilla.mozilla.org/show_bug.cgi?id=1323598) and [1337037](https://bugzilla.mozilla.org/show_bug.cgi?id=1337037). The two versions of the datasets are functionally equivalent to each other if properly handled.
#
# Adding new attributes/columns to this dataset requires recalculating aggregated values. In the case of churn, the following fields are aggregated from a set of similar clients.
#
# ```
# |-- n_profiles: long (nullable = true)
# |-- usage_hours: double (nullable = true)
# |-- sum_squared_usage_hours: double (nullable = true)
# ```
#
# All analysis and reports utilizing this dataset should explicitly aggregate by grouping by relevant columns.
#
# ## Setup
#
# The test data is generated with a 1% sample of main_summary corresponding to `sample_id == 0` for the 1 week churn period starting on `20170205`.
#
# The control dataset is generated from the churn notebook corresponding to [a pinned version of mozilla-reports](https://github.com/mozilla/mozilla-reports/commit/2de3ef16e95ea1b2977d42a7e7a5b9a0aec94f59). The new dataframe is generated through a patched version of the above [rehosted via this gist](https://gist.github.com/acmiyaguchi/cd17a0211ea8c026b44f8da73e84c190). A pull-request will be available soon.
# In[1]:
import pyspark.sql.functions as F
# hosted location of generated datasets
bucket = "net-mozaws-prod-us-west-2-pipeline-analysis"
prefix = "amiyaguchi/test_churn"
old_df = spark.read.parquet("s3://{}/{}/original".format(bucket, prefix))
new_df = spark.read.parquet("s3://{}/{}/updated".format(bucket, prefix))
# ## Observations
#
# The updated churn dataset adds a total of seven fields to the schema, with four fields for stub attribution and three fields for search retention.
# In[2]:
print("Schema for original churn dataset")
old_df.printSchema()
print("Schema for updated churn dataset")
new_df.printSchema()
# The updated dataset contains almost twice as many fields as the original. This corresponds to a finer granularity of aggregates over seven extra fields. However, note that the number of rows is exactly the same if we aggregate over the original set of columns, 314106.
# In[3]:
print("Original row count: {}".format(old_df.count()))
print("New row count: {}".format(new_df.count()))
# In[4]:
original_columns = old_df.columns[:-4]
(
new_df
.groupby(original_columns)
.agg(F.sum("n_profiles").alias("n_profiles"))
).count()
# Additionally, we verify that the totals for both datasets are the same.
# In[5]:
# Show that both dataframes have the same total aggregates
def total_aggregates(df):
agg_cols = ["n_profiles", "usage_hours", "sum_squared_usage_hours"]
df.select([F.sum(x) for x in agg_cols]).show()
print("Total aggregates for original dataframe")
total_aggregates(old_df)
print("Total aggregates for updated dataframe")
total_aggregates(new_df)
# ## Discussion
#
# These observations should suffice for proving that the datasets are equivalent, aside from their level of granularity.
#
# This notebook is not a replacement for unit tests. However, this should explain the differences in how the churn dataset is generated.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment