-
-
Save mreid-moz/518f7515aac54cd246635c333683ecce to your computer and use it in GitHub Desktop.
MainSummaryExample
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# coding: utf-8 | |
# # MainSummary Dataset example | |
# | |
# Documentation can be found [here](https://github.com/mozilla/telemetry-batch-view/blob/master/docs/MainSummary.md). Dataset generation code is [here](https://github.com/mozilla/telemetry-batch-view/blob/master/src/main/scala/com/mozilla/telemetry/views/MainSummaryView.scala). | |
# | |
# Details and requirements are in [Bug 1260847](https://bugzilla.mozilla.org/show_bug.cgi?id=1260847). See also [Bug 1254716](https://bugzilla.mozilla.org/show_bug.cgi?id=1254716). | |
# In[1]: | |
# How many cores are we running on? | |
sc.defaultParallelism | |
# ### Read source data | |
# | |
# Read the data from the parquet datastore on S3. | |
# In[2]: | |
from pyspark.sql import SQLContext | |
from pyspark.sql.types import * | |
bucket = "telemetry-parquet" | |
prefix = "main_summary/v3" | |
get_ipython().magic(u'time dataset = sqlContext.read.load("s3://{}/{}".format(bucket, prefix), "parquet")') | |
# In[3]: | |
dataset.printSchema() | |
# # Example query: | |
# #### How many unique clientIds took part in each Telemetry Experiment, reporting data between July 4 and July 7? | |
# | |
# First, filter for the target time range | |
# In[4]: | |
get_ipython().magic(u"time dataset = dataset.filter(dataset.submission_date_s3 >= '20160704').filter(dataset.submission_date_s3 <= '20160706')") | |
# Then filter for non-null `active_experiment_id` | |
# In[5]: | |
get_ipython().magic(u'time experiments = dataset.filter(dataset.active_experiment_id.isNotNull()).select("active_experiment_id", "client_id")') | |
# Now group by experiment and count the unique `client_id`s | |
# In[6]: | |
from pyspark.sql.functions import countDistinct | |
get_ipython().magic(u'time grouped = experiments.groupby("active_experiment_id").agg(countDistinct(experiments.client_id)).collect()') | |
# In[7]: | |
grouped | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment