acmiyaguchi/MainSummary_SampleId.ipynb Secret

## MainSummary_SampleId.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Main Summary - Sample Id\n",
    "\n",
    "The `sample_id` column in the main summary allows you to work with a smaller subset of the data. In this report we explore the main summary dataset during the month of December 2016, leading up to how you can efficiently use the sample id in your own reports.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 8 ms, sys: 0 ns, total: 8 ms\n",
      "Wall time: 36.7 s\n"
     ]
    }
   ],
   "source": [
    "bucket = \"telemetry-parquet\"\n",
    "prefix = \"main_summary/v3\"\n",
    "%time dataset = sqlContext.read.load(\"s3://{}/{}\".format(bucket, prefix), \"parquet\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The main_summary is partitioned in s3 by submission date to s3 and by a sample_id tag that is given to each ping. Filtering along these attributes is efficient, since spark doesn't have to load the whole dataset into memory. We can use the spark flavored sql operations (sparksql) to find all the data related to December 2016. \n",
    "\n",
    "You'll find that there are many different ways to do the same thing. Below are examples of using the `like` operator and a simple string comparison. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Performance of using col.like\n",
      "CPU times: user 8 ms, sys: 4 ms, total: 12 ms\n",
      "Wall time: 2min 18s\n",
      "Performance of using string comparisons\n",
      "CPU times: user 8 ms, sys: 0 ns, total: 8 ms\n",
      "Wall time: 1min 29s\n"
     ]
    }
   ],
   "source": [
    "from pyspark.sql.functions import col\n",
    "like_function_df = dataset.where(col(\"submission_date_s3\").like(\"201612%\"))\n",
    "string_comp_df = (\n",
    "    dataset\n",
    "        .where(col(\"submission_date_s3\") >= \"20161201\")\n",
    "        .where(col(\"submission_date_s3\") < \"20170101\")\n",
    ")\n",
    "\n",
    "print(\"Performance of using col.like\")\n",
    "%time like_function_df.count()\n",
    "\n",
    "print(\"Performance of using string comparisons\")\n",
    "%time month_count = string_comp_df.count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# both of the above perform roughly the same\n",
    "month_df = string_comp_df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The main summary has a sample_id attribute, where each id is a 1% sample of the whole dataset. The ids are labels in the range [1, 100]. Here, we show that the distribution of sample_ids is uniform. The extra column are pings where `sample_id=null`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Filtering based on day\n",
      "CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
      "Wall time: 18.4 ms\n",
      "Result: 425907777\n",
      "+-------+-----------------+\n",
      "|summary|            count|\n",
      "+-------+-----------------+\n",
      "|  count|              101|\n",
      "|   mean|4216908.683168317|\n",
      "| stddev| 425505.970092687|\n",
      "|    min|             8312|\n",
      "|    max|          4501883|\n",
      "+-------+-----------------+\n",
      "\n"
     ]
    }
   ],
   "source": [
    "print(\"Filtering based on day\")\n",
    "%time day_df = month_df.where(col(\"submission_date_s3\") == \"20161201\")\n",
    "print(\"Result: \" + str(day_df.count()))\n",
    "\n",
    "day_df.groupBy(\"sample_id\").count().describe().show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Armed with this knowledge, we can sample a certain percentage of the data. We sample 1, 5, 10, and 25% of the data using a random set of ids."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 4 ms, sys: 0 ns, total: 4 ms\n",
      "Wall time: 11.2 s\n",
      "Sampled 1% - Expected 111478398, Result 112101487.93, % Error 0.555826636654 \n",
      "\n",
      "CPU times: user 8 ms, sys: 0 ns, total: 8 ms\n",
      "Wall time: 8.97 s\n",
      "Sampled 5% - Expected 561731626, Result 560507439.65, % Error 0.218406797734 \n",
      "\n",
      "CPU times: user 8 ms, sys: 4 ms, total: 12 ms\n",
      "Wall time: 15.2 s\n",
      "Sampled 10% - Expected 1119618584, Result 1121014879.3, % Error 0.124556357439 \n",
      "\n",
      "CPU times: user 12 ms, sys: 0 ns, total: 12 ms\n",
      "Wall time: 31.6 s\n",
      "Sampled 25% - Expected 2808411637, Result 2802537198.25, % Error 0.209611446145 \n",
      "\n"
     ]
    }
   ],
   "source": [
    "import random\n",
    "\n",
    "# use the variables month_df, month_count from previous cells\n",
    "\n",
    "def percent_error(expected, theoretical):\n",
    "    return abs(expected-theoretical)/float(theoretical) * 100.0\n",
    "\n",
    "percentages = [1, 5, 10, 25]\n",
    "labels = set(range(1, 101))\n",
    "\n",
    "for percentage in percentages:\n",
    "    # percentage is an integer in [1, 100] corresponding to total percentage to sample\n",
    "    sampled = random.sample(labels, percentage)\n",
    "    \n",
    "    # equivalent to `select count(sample_id) from month_df where month_df.sample_id in (1, 2, 3 ...)`\n",
    "    %time result = month_df.where(col(\"sample_id\").isin(sampled)).count()\n",
    "    \n",
    "    expected = month_count * (percentage / 100.0)\n",
    "    percent_err = percent_error(result, expected)\n",
    "    print(\"Sampled {}% - Expected {}, Result {}, % Error {} \\n\"\n",
    "              .format(percentage, result, expected, percent_err))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The set of ids are chosen randomly from all possible sample_ids. We take advantage of the `isin` operator ([sparksql](https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.Column.isin)) to count all the rows where the sample_id is within the set of ids.  Note that our percentage error from the total sample is low, which allows us to use this subset as a good approximation of the full set. "
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [conda root]",
   "language": "python",
   "name": "conda-root-py"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}

## MainSummary_SampleId.py

# coding: utf-8

# # Main Summary - Sample Id
#
# The `sample_id` column in the main summary allows you to work with a smaller subset of the data. In this report we explore the main summary dataset during the month of December 2016, leading up to how you can efficiently use the sample id in your own reports.
#

# In[1]:

bucket = "telemetry-parquet"
prefix = "main_summary/v3"
get_ipython().magic(u'time dataset = sqlContext.read.load("s3://{}/{}".format(bucket, prefix), "parquet")')


# The main_summary is partitioned in s3 by submission date to s3 and by a sample_id tag that is given to each ping. Filtering along these attributes is efficient, since spark doesn't have to load the whole dataset into memory. We can use the spark flavored sql operations (sparksql) to find all the data related to December 2016.
#
# You'll find that there are many different ways to do the same thing. Below are examples of using the `like` operator and a simple string comparison.

# In[22]:

from pyspark.sql.functions import col
like_function_df = dataset.where(col("submission_date_s3").like("201612%"))
string_comp_df = (
    dataset
        .where(col("submission_date_s3") >= "20161201")
        .where(col("submission_date_s3") < "20170101")
)

print("Performance of using col.like")
get_ipython().magic(u'time like_function_df.count()')

print("Performance of using string comparisons")
get_ipython().magic(u'time month_count = string_comp_df.count()')


# In[5]:

# both of the above perform roughly the same
month_df = string_comp_df


# The main summary has a sample_id attribute, where each id is a 1% sample of the whole dataset. The ids are labels in the range [1, 100]. Here, we show that the distribution of sample_ids is uniform. The extra column are pings where `sample_id=null`.

# In[14]:

print("Filtering based on day")
get_ipython().magic(u'time day_df = month_df.where(col("submission_date_s3") == "20161201")')
print("Result: " + str(day_df.count()))

day_df.groupBy("sample_id").count().describe().show()


# Armed with this knowledge, we can sample a certain percentage of the data. We sample 1, 5, 10, and 25% of the data using a random set of ids.

# In[24]:

import random

# use the variables month_df, month_count from previous cells

def percent_error(expected, theoretical):
    return abs(expected-theoretical)/float(theoretical) * 100.0

percentages = [1, 5, 10, 25]
labels = set(range(1, 101))

for percentage in percentages:
    # percentage is an integer in [1, 100] corresponding to total percentage to sample
    sampled = random.sample(labels, percentage)

    # equivalent to `select count(sample_id) from month_df where month_df.sample_id in (1, 2, 3 ...)`
    get_ipython().magic(u'time result = month_df.where(col("sample_id").isin(sampled)).count()')

    expected = month_count * (percentage / 100.0)
    percent_err = percent_error(result, expected)
    print("Sampled {}% - Expected {}, Result {}, % Error {} \n"
              .format(percentage, result, expected, percent_err))


# The set of ids are chosen randomly from all possible sample_ids. We take advantage of the `isin` operator ([sparksql](https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.Column.isin)) to count all the rows where the sample_id is within the set of ids.  Note that our percentage error from the total sample is low, which allows us to use this subset as a good approximation of the full set.
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Main Summary - Sample Id\n",
	"\n",
	"The `sample_id` column in the main summary allows you to work with a smaller subset of the data. In this report we explore the main summary dataset during the month of December 2016, leading up to how you can efficiently use the sample id in your own reports.\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 8 ms, sys: 0 ns, total: 8 ms\n",
	"Wall time: 36.7 s\n"
	]
	}
	],
	"source": [
	"bucket = \"telemetry-parquet\"\n",
	"prefix = \"main_summary/v3\"\n",
	"%time dataset = sqlContext.read.load(\"s3://{}/{}\".format(bucket, prefix), \"parquet\")"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The main_summary is partitioned in s3 by submission date to s3 and by a sample_id tag that is given to each ping. Filtering along these attributes is efficient, since spark doesn't have to load the whole dataset into memory. We can use the spark flavored sql operations (sparksql) to find all the data related to December 2016. \n",
	"\n",
	"You'll find that there are many different ways to do the same thing. Below are examples of using the `like` operator and a simple string comparison. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Performance of using col.like\n",
	"CPU times: user 8 ms, sys: 4 ms, total: 12 ms\n",
	"Wall time: 2min 18s\n",
	"Performance of using string comparisons\n",
	"CPU times: user 8 ms, sys: 0 ns, total: 8 ms\n",
	"Wall time: 1min 29s\n"
	]
	}
	],
	"source": [
	"from pyspark.sql.functions import col\n",
	"like_function_df = dataset.where(col(\"submission_date_s3\").like(\"201612%\"))\n",
	"string_comp_df = (\n",
	" dataset\n",
	" .where(col(\"submission_date_s3\") >= \"20161201\")\n",
	" .where(col(\"submission_date_s3\") < \"20170101\")\n",
	")\n",
	"\n",
	"print(\"Performance of using col.like\")\n",
	"%time like_function_df.count()\n",
	"\n",
	"print(\"Performance of using string comparisons\")\n",
	"%time month_count = string_comp_df.count()"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"# both of the above perform roughly the same\n",
	"month_df = string_comp_df"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The main summary has a sample_id attribute, where each id is a 1% sample of the whole dataset. The ids are labels in the range [1, 100]. Here, we show that the distribution of sample_ids is uniform. The extra column are pings where `sample_id=null`. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 14,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Filtering based on day\n",
	"CPU times: user 0 ns, sys: 0 ns, total: 0 ns\n",
	"Wall time: 18.4 ms\n",
	"Result: 425907777\n",
	"+-------+-----------------+\n",
	"\|summary\| count\|\n",
	"+-------+-----------------+\n",
	"\| count\| 101\|\n",
	"\| mean\|4216908.683168317\|\n",
	"\| stddev\| 425505.970092687\|\n",
	"\| min\| 8312\|\n",
	"\| max\| 4501883\|\n",
	"+-------+-----------------+\n",
	"\n"
	]
	}
	],
	"source": [
	"print(\"Filtering based on day\")\n",
	"%time day_df = month_df.where(col(\"submission_date_s3\") == \"20161201\")\n",
	"print(\"Result: \" + str(day_df.count()))\n",
	"\n",
	"day_df.groupBy(\"sample_id\").count().describe().show()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Armed with this knowledge, we can sample a certain percentage of the data. We sample 1, 5, 10, and 25% of the data using a random set of ids."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 24,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"CPU times: user 4 ms, sys: 0 ns, total: 4 ms\n",
	"Wall time: 11.2 s\n",
	"Sampled 1% - Expected 111478398, Result 112101487.93, % Error 0.555826636654 \n",
	"\n",
	"CPU times: user 8 ms, sys: 0 ns, total: 8 ms\n",
	"Wall time: 8.97 s\n",
	"Sampled 5% - Expected 561731626, Result 560507439.65, % Error 0.218406797734 \n",
	"\n",
	"CPU times: user 8 ms, sys: 4 ms, total: 12 ms\n",
	"Wall time: 15.2 s\n",
	"Sampled 10% - Expected 1119618584, Result 1121014879.3, % Error 0.124556357439 \n",
	"\n",
	"CPU times: user 12 ms, sys: 0 ns, total: 12 ms\n",
	"Wall time: 31.6 s\n",
	"Sampled 25% - Expected 2808411637, Result 2802537198.25, % Error 0.209611446145 \n",
	"\n"
	]
	}
	],
	"source": [
	"import random\n",
	"\n",
	"# use the variables month_df, month_count from previous cells\n",
	"\n",
	"def percent_error(expected, theoretical):\n",
	" return abs(expected-theoretical)/float(theoretical) * 100.0\n",
	"\n",
	"percentages = [1, 5, 10, 25]\n",
	"labels = set(range(1, 101))\n",
	"\n",
	"for percentage in percentages:\n",
	" # percentage is an integer in [1, 100] corresponding to total percentage to sample\n",
	" sampled = random.sample(labels, percentage)\n",
	" \n",
	" # equivalent to `select count(sample_id) from month_df where month_df.sample_id in (1, 2, 3 ...)`\n",
	" %time result = month_df.where(col(\"sample_id\").isin(sampled)).count()\n",
	" \n",
	" expected = month_count * (percentage / 100.0)\n",
	" percent_err = percent_error(result, expected)\n",
	" print(\"Sampled {}% - Expected {}, Result {}, % Error {} \\n\"\n",
	" .format(percentage, result, expected, percent_err))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The set of ids are chosen randomly from all possible sample_ids. We take advantage of the `isin` operator ([sparksql](https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.Column.isin)) to count all the rows where the sample_id is within the set of ids. Note that our percentage error from the total sample is low, which allows us to use this subset as a good approximation of the full set. "
	]
	}
	],
	"metadata": {
	"anaconda-cloud": {},
	"kernelspec": {
	"display_name": "Python [conda root]",
	"language": "python",
	"name": "conda-root-py"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}

	# coding: utf-8

	# # Main Summary - Sample Id
	#
	# The `sample_id` column in the main summary allows you to work with a smaller subset of the data. In this report we explore the main summary dataset during the month of December 2016, leading up to how you can efficiently use the sample id in your own reports.
	#

	# In[1]:

	bucket = "telemetry-parquet"
	prefix = "main_summary/v3"
	get_ipython().magic(u'time dataset = sqlContext.read.load("s3://{}/{}".format(bucket, prefix), "parquet")')


	# The main_summary is partitioned in s3 by submission date to s3 and by a sample_id tag that is given to each ping. Filtering along these attributes is efficient, since spark doesn't have to load the whole dataset into memory. We can use the spark flavored sql operations (sparksql) to find all the data related to December 2016.
	#
	# You'll find that there are many different ways to do the same thing. Below are examples of using the `like` operator and a simple string comparison.

	# In[22]:

	from pyspark.sql.functions import col
	like_function_df = dataset.where(col("submission_date_s3").like("201612%"))
	string_comp_df = (
	dataset
	.where(col("submission_date_s3") >= "20161201")
	.where(col("submission_date_s3") < "20170101")
	)

	print("Performance of using col.like")
	get_ipython().magic(u'time like_function_df.count()')

	print("Performance of using string comparisons")
	get_ipython().magic(u'time month_count = string_comp_df.count()')


	# In[5]:

	# both of the above perform roughly the same
	month_df = string_comp_df


	# The main summary has a sample_id attribute, where each id is a 1% sample of the whole dataset. The ids are labels in the range [1, 100]. Here, we show that the distribution of sample_ids is uniform. The extra column are pings where `sample_id=null`.

	# In[14]:

	print("Filtering based on day")
	get_ipython().magic(u'time day_df = month_df.where(col("submission_date_s3") == "20161201")')
	print("Result: " + str(day_df.count()))

	day_df.groupBy("sample_id").count().describe().show()


	# Armed with this knowledge, we can sample a certain percentage of the data. We sample 1, 5, 10, and 25% of the data using a random set of ids.

	# In[24]:

	import random

	# use the variables month_df, month_count from previous cells

	def percent_error(expected, theoretical):
	return abs(expected-theoretical)/float(theoretical) * 100.0

	percentages = [1, 5, 10, 25]
	labels = set(range(1, 101))

	for percentage in percentages:
	# percentage is an integer in [1, 100] corresponding to total percentage to sample
	sampled = random.sample(labels, percentage)

	# equivalent to `select count(sample_id) from month_df where month_df.sample_id in (1, 2, 3 ...)`
	get_ipython().magic(u'time result = month_df.where(col("sample_id").isin(sampled)).count()')

	expected = month_count * (percentage / 100.0)
	percent_err = percent_error(result, expected)
	print("Sampled {}% - Expected {}, Result {}, % Error {} \n"
	.format(percentage, result, expected, percent_err))


	# The set of ids are chosen randomly from all possible sample_ids. We take advantage of the `isin` operator ([sparksql](https://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.Column.isin)) to count all the rows where the sample_id is within the set of ids. Note that our percentage error from the total sample is low, which allows us to use this subset as a good approximation of the full set.