Skip to content

Instantly share code, notes, and snippets.

View SamPenrose's full-sized avatar

Sam Penrose SamPenrose

View GitHub Profile
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@SamPenrose
SamPenrose / gist:1a8133d2ef6d251addfc
Last active August 29, 2015 14:22
Mozilla FHRv4 subsessionId duplicates
{"nbformat_minor": 0, "cells": [{"execution_count": 1, "cell_type": "code", "source": "import ujson as json\nimport matplotlib.pyplot as plt\nimport pandas as pd\nimport numpy as np\nimport plotly.plotly as py\nimport networkx as nx\nimport collections\n\n\nfrom moztelemetry import get_pings, get_pings_properties, get_one_ping_per_client\n\n%pylab inline", "outputs": [{"output_type": "stream", "name": "stdout", "text": "Populating the interactive namespace from numpy and matplotlib\n"}], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 3, "cell_type": "code", "source": "pings = get_pings(sc, app=\"Firefox\",\n channel=\"nightly\",\n submission_date=(\"20150507\",\"20150514\"),\n fraction=1,\n schema=\"v4\")", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 38, "cell_type": "code", "source": "def extract_sub(p):\n return p.get('payload', {}).get('info', {}).get('subsessionId', 'NO
{"nbformat_minor": 0, "cells": [{"source": "# Session Signature matching", "cell_type": "markdown", "metadata": {}}, {"execution_count": 81, "cell_type": "code", "source": "import ujson as json\nfrom operator import add\n# %pylab inline", "outputs": [], "metadata": {"collapsed": false, "trusted": true}}, {"execution_count": 82, "cell_type": "code", "source": "outBucketName = \"net-mozaws-prod-us-west-2-pipeline-analysis\"\npathToOutput = \"/bcolloran/mergedDataPerClient/nightly/2015-06-15/10009clients/\"", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 83, "cell_type": "code", "source": "# for a tiny sample, you can load one part: \"part-00000\"\n# or you can do more--\n# ten parts: part-0000*\n# or 10% of parts: part-*0\n# or all parts: part-*\npath_to_all = \"s3n://\"+outBucketName+pathToOutput+\"part-*\"\nf = sc.sequenceFile(path_to_all)\nload_all = f.mapValues(json.loads)", "outputs": [], "metadata": {"collapsed": true, "trusted": true}}, {"execution_count": 84, "ce
@SamPenrose
SamPenrose / v2_v4_aurora_search_counts.ipynb
Created July 6, 2015 03:20
v2 vs v4 search counts in aurora
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
from collections import defaultdict
def get_overlap(pair, v2_extractor=None, v4_extractor=None):
v2_blobs = pair['v2'].get('data', {}).get('days', {}) # {'YYYY-MM-DD': dict}
v4_blobs = pair['v4'] # [{'creationDate': 'YYYY-MM-DD:...', 'k': val, ...}, ...]
# One blob per date in v2, multiple per date in v4
results = {'v2': {}, 'v4': defaultdict(list)}
if not (v2_blobs and v4_blobs):
return results
v2_dates = v2_blobs.keys()
@SamPenrose
SamPenrose / search_pairs.ipynb
Last active August 29, 2015 14:25
Compare search counts in paired v2 and v4 FHR pings.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@SamPenrose
SamPenrose / extract_v4_pings_from_gzip.ipynb
Created July 17, 2015 22:25
Extracting v4 pings from a gzipped file on Spark
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@SamPenrose
SamPenrose / pairs_4days_quarter_keys.ipynb
Created July 28, 2015 20:14
Struggling to join RDDs
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.