Dexterp37/newprofile_ping_validation.ipynb Secret

## newprofile_ping_validation.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              newprofile_ping_validation.ipynb
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## newprofile_ping_validation.py

# coding: utf-8
---
title: "new-profile ping validation on Nightly"
authors:
- dexter
tags:
- tutorial
- examples
- telemetry
- spark
created_at: 2017-06-07
tldr: This notebook verifies that the 'new-profile' ping behaves as expected on the Nightly channel.
---
# # Validate 'new-profile' submissions on Nightly
# This analysis validates the 'new-profile' pings submitted by Nightly builds for one week since the 'new-profile' ping was enabled [in bug 1364068](https://bugzilla.mozilla.org/show_bug.cgi?id=1364068). We are going to verify that:
#
# - the install ping is received within a reasonable time after the profile creation;
# - we receive one ping per client;
# - we don't receive many duplicates overall.

# In[73]:

import ujson as json
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import plotly.plotly as py
import IPython

from plotly.graph_objs import *
from moztelemetry import get_pings_properties, get_one_ping_per_client
from moztelemetry.dataset import Dataset
from datetime import datetime, timedelta

get_ipython().magic(u'matplotlib inline')
IPython.core.pylabtools.figsize(16, 7)


# We'll be looking at the pings that have been coming in since the 31st May 2017 to the 7th June 2017.

# In[74]:

# Note that the 'new-profile' ping needs to use underscores in the Dataset API due to bug.
pings = Dataset.from_source("telemetry")     .where(docType='new_profile')     .where(appUpdateChannel="nightly")     .where(submissionDate=lambda x: "20170531" <= x < "20170607")     .where(appBuildId=lambda x: "20170531" <= x < "20170607")     .records(sc, sample=1.0)


# In[75]:

ping_count = pings.count()
ping_count


# ### How many pings were sent in-session and how many at shutdown?
# The `new-profile` ping can be sent either during the browsing session, 30 minutes after the browser starts, or at shutdown ([docs](https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/new-profile-ping.html#payload-reason)). Let's see how many pings we get in each case.

# In[76]:

raw_subset = get_pings_properties(pings, ["id",
                                          "meta/creationTimestamp",
                                          "meta/Date",
                                          "meta/Timestamp",
                                          "meta/X-PingSender-Version",
                                          "clientId",
                                          "environment/profile/creationDate",
                                          "payload/reason"])


# Discard and count any ping that's missing creationTimestamp or Timestamp.

# In[77]:

def pct(a, b):
    return 100.0 * a / b

subset = raw_subset.filter(lambda p: p["meta/creationTimestamp"] is not None and p["meta/Timestamp"] is not None)
print("'new-profile' pings with missing timestamps:\t{:.2f}%".format(pct(ping_count - subset.count(), ping_count)))


# In[78]:

reason_counts = subset.map(lambda p: p.get("payload/reason")).countByValue()

for reason, count in reason_counts.iteritems():
    print("'new-profile' pings with reason '{}':\t{:.2f}%".format(reason, pct(count, ping_count)))


# This means that, among all the `new-profile` pings, the majority was sent at shutdown. This could mean different things:
#
# - the browsing session lasted less than 30 minutes;
# - we're receiving duplicate pings at shutdown.
#
# ### Let's check how many duplicates we've seen

# In[79]:

def dedupe(pings, duping_key):
    return pings            .map(lambda p: (p[duping_key], p))            .reduceByKey(lambda a, b: a if a["meta/Timestamp"] < b["meta/Timestamp"] else b)            .map(lambda pair: pair[1])

deduped_docid = dedupe(subset, "id")
deduped_docid_count = deduped_docid.count()
total_duplicates = ping_count - deduped_docid_count
print("Duplicate pings percentage (by document id): {:.2f}%".format(pct(total_duplicates, ping_count)))


# The 0.21% of ping duplicates is nice, compared to ~1% we usually get from the `main` and `crash` pings. However, nowdays we're running de-duplication by document id at the pipeline ingestion, so this might be a bit higher.
#
# Did we send different pings for the same client id? We shouldn't, as we send at most one 'new-profile' ping per client.

# In[80]:

deduped_clientid = dedupe(deduped_docid, "clientId")
total_duplicates_clientid = deduped_docid_count - deduped_clientid.count()
print("Duplicate pings percentage (by client id): {:.2f}%".format(pct(total_duplicates_clientid, deduped_docid_count)))


# That's disappointing: it looks like we're receiving multiple `new-profile` pings for some clients. Let's dig into this by analysing the set of pings deduped by document id. To have a clearer picture of the problem, let's make sure to aggregate the duplicates ordered by the time they were created on the client.

# In[81]:

# Builds an RDD with (<client id>, [<ordered reason>, <ordered reason>, ...])
clients_with_dupes = deduped_docid.map(lambda p: (p["clientId"], [(p["payload/reason"], p["meta/creationTimestamp"])]))                                  .reduceByKey(lambda a, b: sorted(a + b, key=lambda k: k[1]))                                  .filter(lambda p: len(p[1]) > 1)                                  .map(lambda p: (p[0], [r[0] for r in p[1]]))

# Check how often each case occurs.
sorted(clients_with_dupes.map(lambda p: tuple(p[1])).countByValue().items(), key=lambda k: k[1], reverse=True)


# In[82]:

collected_clientid_reasons = clients_with_dupes.collect()


# In[83]:

num_shutdown_dupes = sum([len(filter(lambda e: e == 'shutdown', t[1])) for t in collected_clientid_reasons])
print("Duplicate 'shutdown' pings percentage (by client id): {:.2f}%"      .format(pct(num_shutdown_dupes, ping_count)))


# The multiple pings we're receiving for the same client id are mostly `new-profile` pings with reason `shutdown`. This is not too surprising, as most of the `new-profile` pings are being sent at shutdown (75%).
#
# But does the pingsender have anything to do with this? Let's attack the problem like this:
#
# - get a list of the "misbehaving" clients;
# - take a peek at their pings (redact client ids/ids);
# - figure out the next steps.

# In[84]:

sorted_clientid_dupes = sorted(collected_clientid_reasons, key=lambda k: len(k[1]), reverse=True)
misbehaving_clients = [client_id for client_id, _ in sorted_clientid_dupes]
to_redact = subset.filter(lambda p: p.get("clientId") == misbehaving_clients[1]).collect()
# Leave out the client ids and doc ids.
for i,_ in enumerate(to_redact):
    to_redact[i]['clientId'] = 'XXX'
    to_redact[i]['id'] = 'XXX'
to_redact


# Interesting! Looks like many of these pings are missing the pingsender header from the request, meaning:
#
# - that they are not being sent from the pingsender, even though they are generated at shutdown;
# - that we generated a new-profile ping at shutdown, but failed to mark it as 'generated', and so we received more than one with a different document it.
#
# This leads to other questions:
#
# - How often do we send new-profile pings at shutdown, fail, and then send them again without the pingsender?
# - Does that correlate with the duplicates?

# In[85]:

newprofile_shutdown_from_bad_clients =    subset.filter(lambda p: p.get("payload/reason") == 'shutdown')          .filter(lambda p: p.get("clientId") in misbehaving_clients)

newprofile_shutdown_from_good_clients =    subset.filter(lambda p: p.get("payload/reason") == 'shutdown')          .filter(lambda p: p.get("clientId") not in misbehaving_clients)


# In[86]:

newprofile_shutdown_from_bad_clients    .map(lambda p: p.get("meta/X-PingSender-Version")).countByValue()


# In[87]:

newprofile_shutdown_from_good_clients    .map(lambda p: p.get("meta/X-PingSender-Version")).countByValue()


# Sending `new-profile` pings with reason `shutdown` seems to happen half of the times with misbehaving clients and about 10% of the times with well behaving clients. The pingsender doesn't seem to be the issue here: if we generate a ping at shutdown and try to send it with the pingsender, and fail, then it's normal for Firefox to pick it back up and send it. As long as we don't generate a new, different, `new-profile` ping for the same client. And this is something that well behaving clients don't do, so that 10% should be safe.

# ### Does the profileCreationDate match the date we received the pings?

# In[88]:

def datetime_from_daysepoch(days_from_epoch):
    return datetime(1970, 1, 1, 0, 0) + timedelta(days=days_from_epoch)

def datetime_from_nanosepoch(nanos_from_epoch):
    return datetime.fromtimestamp(nanos_from_epoch / 1000.0 / 1000.0 / 1000.0)

def get_times(p):
    profile_creation = datetime_from_daysepoch(p["environment/profile/creationDate"])                            if p["environment/profile/creationDate"] else None
    ping_creation = datetime_from_nanosepoch(p["meta/creationTimestamp"])
    ping_recv = datetime_from_nanosepoch(p["meta/Timestamp"])

    return (p["id"], profile_creation, ping_creation, ping_recv)

ping_times = subset.map(get_times)


# In[89]:

ping_creation_delay_days = ping_times.filter(lambda p: p[1] is not None)                                     .map(lambda p: abs((p[1].date() - p[2].date()).days)).collect()


# In[90]:

plt.title("The distribution of the days between the profile creationDate and the 'new-profile' ping creation date")
plt.xlabel("Difference in days")
plt.ylabel("Frequency")

CLIP_DAY = 30
plt.xticks(range(0, CLIP_DAY + 1, 1))
plt.hist(np.clip(ping_creation_delay_days, 0, CLIP_DAY),
         alpha=0.5, bins=50, label="Delays")


# In[91]:

np.percentile(np.array(ping_creation_delay_days), [50, 70, 80, 95, 99])


# The plot shows that most of the creation dates for `new-profile` pings match exactly with the date reported in the environment, `creationDate`. That's good, as this ping should be created very close to the profile creation. The percentile computation confirms that's true for 70% of the `new-profile` pings.
#
# However, things get tricky. The `new-profile` ping was enabled on the 30th of May, 2017 and should only be sent my new profiles. A delay longer than 7 days suggests that either the profile `creationDate` is terribly wrong or that we're sending the `new-profile` ping from existing profiles as well.
#
# ### Cross-check the `new-profile` and `main` pings

# In[92]:

main_pings = Dataset.from_source("telemetry")                     .where(docType='main')                     .where(appUpdateChannel="nightly")                     .where(submissionDate=lambda x: "20170531" <= x < "20170607")                     .where(appBuildId=lambda x: "20170531" <= x < "20170607")                     .records(sc, sample=1.0)


# In[93]:

main_subset = get_pings_properties(main_pings, ["id",
                                                "meta/creationTimestamp",
                                                "meta/Date",
                                                "meta/Timestamp",
                                                "meta/X-PingSender-Version",
                                                "clientId",
                                                "environment/profile/creationDate",
                                                "payload/info/reason",
                                                "payload/info/sessionLength",
                                                "payload/info/subsessionLength",
                                                "payload/info/profileSubsessionCounter",
                                                "payload/info/previousSessionId"])


# Dedupe by document id and restrict the `main` ping data to the pings from the misbehaving and well behaving clients.

# In[94]:

well_behaving_clients =    set(subset.filter(lambda p: p.get("clientId") not in misbehaving_clients).map(lambda p: p.get("clientId")).collect())

all_clients = misbehaving_clients + list(well_behaving_clients)


# In[95]:

main_deduped = dedupe(main_subset.filter(lambda p: p.get("clientId") in all_clients), "id")
main_deduped_count = main_deduped.count()


# Try to pair each `new-profile` ping with reason `shutdown` to the very first `main` ping with reason `shutdown` received from that client, to make sure that the former were sent at the right time.

# In[96]:

first_main = main_deduped.filter(lambda p:                                    p.get("payload/info/previousSessionId") == None and                                    p.get("payload/info/reason") == "shutdown")


# In[97]:

newping_shutdown = deduped_docid.filter(lambda p: p.get("payload/reason") == "shutdown")


# In[98]:

newprofile_plus_main = first_main.union(newping_shutdown)
sorted_per_client = newprofile_plus_main.map(lambda p: (p["clientId"], [(p, p["meta/creationTimestamp"])]))                                        .reduceByKey(lambda a, b: sorted(a + b, key=lambda k: k[1]))                                        .filter(lambda p: len(p[1]) > 1)                                        .map(lambda p: (p[0], [r[0] for r in p[1]]))
num_analysed_clients = sorted_per_client.count()


# In[99]:

HALF_HOUR_IN_S = 30 * 60

def is_newprofile(p):
    # The 'main' ping has the reason field in 'payload/info/reason'
    return "payload/reason" in p and p.get("payload/reason") in ["startup", "shutdown"]

def validate_newprofile_scheduling(client_data):
    ordered_pings = client_data[1]
    # If there's a new-profile ping with reason 'shutdown', look for the closest next
    # 'main' ping with reason shutdown.
    for i, p in enumerate(ordered_pings):
        # Skip until we find the 'new-profile' ping.
        if not is_newprofile(p):
            continue

        # We found the 'new-profile' ping. Do we have any other ping
        # after this?
        next_index = i + 1
        if next_index >= len(ordered_pings):
            return ("No more pings after the 'new-profile'", 1)

        # Is the next ping a 'main' ping with reason 'shutdown'?
        next_ping = ordered_pings[next_index]
        if is_newprofile(next_ping):
            return ("Duplicate 'new-profile' ping.", 1)

        if next_ping.get("payload/info/reason") == "shutdown":
            if next_ping.get("payload/info/sessionLength") <= HALF_HOUR_IN_S:
                return ("The 'new-profile' ping was correctly scheduled", 1)

            return ("The 'new-profile' ping was scheduled at the wrong time", 1)

    return ("No 'shutdown' new-profile ping found", 1)

scheduling_error_counts = sorted_per_client.map(validate_newprofile_scheduling).countByKey()


# In[100]:

for error, count in scheduling_error_counts.items():
    print("{}:\t{}".format(error, pct(count, num_analysed_clients)))


# Most of the `new-profile` pings sent at `shutdown`, 98.37%, were correctly generated because the session lasted less than 30 minutes. Only 0.88% were scheduled at the wrong time. The rest of the clients either sent the `new-profile` at startup or we're still waiting for their `main` ping with reason `shutdown`.

# In[ ]:

	# coding: utf-8
	---
	title: "new-profile ping validation on Nightly"
	authors:
	- dexter
	tags:
	- tutorial
	- examples
	- telemetry
	- spark
	created_at: 2017-06-07
	tldr: This notebook verifies that the 'new-profile' ping behaves as expected on the Nightly channel.
	---
	# # Validate 'new-profile' submissions on Nightly
	# This analysis validates the 'new-profile' pings submitted by Nightly builds for one week since the 'new-profile' ping was enabled [in bug 1364068](https://bugzilla.mozilla.org/show_bug.cgi?id=1364068). We are going to verify that:
	#
	# - the install ping is received within a reasonable time after the profile creation;
	# - we receive one ping per client;
	# - we don't receive many duplicates overall.

	# In[73]:

	import ujson as json
	import matplotlib.pyplot as plt
	import pandas as pd
	import numpy as np
	import plotly.plotly as py
	import IPython

	from plotly.graph_objs import *
	from moztelemetry import get_pings_properties, get_one_ping_per_client
	from moztelemetry.dataset import Dataset
	from datetime import datetime, timedelta

	get_ipython().magic(u'matplotlib inline')
	IPython.core.pylabtools.figsize(16, 7)


	# We'll be looking at the pings that have been coming in since the 31st May 2017 to the 7th June 2017.

	# In[74]:

	# Note that the 'new-profile' ping needs to use underscores in the Dataset API due to bug.
	pings = Dataset.from_source("telemetry") .where(docType='new_profile') .where(appUpdateChannel="nightly") .where(submissionDate=lambda x: "20170531" <= x < "20170607") .where(appBuildId=lambda x: "20170531" <= x < "20170607") .records(sc, sample=1.0)


	# In[75]:

	ping_count = pings.count()
	ping_count


	# ### How many pings were sent in-session and how many at shutdown?
	# The `new-profile` ping can be sent either during the browsing session, 30 minutes after the browser starts, or at shutdown ([docs](https://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/new-profile-ping.html#payload-reason)). Let's see how many pings we get in each case.

	# In[76]:

	raw_subset = get_pings_properties(pings, ["id",
	"meta/creationTimestamp",
	"meta/Date",
	"meta/Timestamp",
	"meta/X-PingSender-Version",
	"clientId",
	"environment/profile/creationDate",
	"payload/reason"])


	# Discard and count any ping that's missing creationTimestamp or Timestamp.

	# In[77]:

	def pct(a, b):
	return 100.0 * a / b

	subset = raw_subset.filter(lambda p: p["meta/creationTimestamp"] is not None and p["meta/Timestamp"] is not None)
	print("'new-profile' pings with missing timestamps:\t{:.2f}%".format(pct(ping_count - subset.count(), ping_count)))


	# In[78]:

	reason_counts = subset.map(lambda p: p.get("payload/reason")).countByValue()

	for reason, count in reason_counts.iteritems():
	print("'new-profile' pings with reason '{}':\t{:.2f}%".format(reason, pct(count, ping_count)))


	# This means that, among all the `new-profile` pings, the majority was sent at shutdown. This could mean different things:
	#
	# - the browsing session lasted less than 30 minutes;
	# - we're receiving duplicate pings at shutdown.
	#
	# ### Let's check how many duplicates we've seen

	# In[79]:

	def dedupe(pings, duping_key):
	return pings .map(lambda p: (p[duping_key], p)) .reduceByKey(lambda a, b: a if a["meta/Timestamp"] < b["meta/Timestamp"] else b) .map(lambda pair: pair[1])

	deduped_docid = dedupe(subset, "id")
	deduped_docid_count = deduped_docid.count()
	total_duplicates = ping_count - deduped_docid_count
	print("Duplicate pings percentage (by document id): {:.2f}%".format(pct(total_duplicates, ping_count)))


	# The 0.21% of ping duplicates is nice, compared to ~1% we usually get from the `main` and `crash` pings. However, nowdays we're running de-duplication by document id at the pipeline ingestion, so this might be a bit higher.
	#
	# Did we send different pings for the same client id? We shouldn't, as we send at most one 'new-profile' ping per client.

	# In[80]:

	deduped_clientid = dedupe(deduped_docid, "clientId")
	total_duplicates_clientid = deduped_docid_count - deduped_clientid.count()
	print("Duplicate pings percentage (by client id): {:.2f}%".format(pct(total_duplicates_clientid, deduped_docid_count)))


	# That's disappointing: it looks like we're receiving multiple `new-profile` pings for some clients. Let's dig into this by analysing the set of pings deduped by document id. To have a clearer picture of the problem, let's make sure to aggregate the duplicates ordered by the time they were created on the client.

	# In[81]:

	# Builds an RDD with (<client id>, [<ordered reason>, <ordered reason>, ...])
	clients_with_dupes = deduped_docid.map(lambda p: (p["clientId"], [(p["payload/reason"], p["meta/creationTimestamp"])])) .reduceByKey(lambda a, b: sorted(a + b, key=lambda k: k[1])) .filter(lambda p: len(p[1]) > 1) .map(lambda p: (p[0], [r[0] for r in p[1]]))

	# Check how often each case occurs.
	sorted(clients_with_dupes.map(lambda p: tuple(p[1])).countByValue().items(), key=lambda k: k[1], reverse=True)


	# In[82]:

	collected_clientid_reasons = clients_with_dupes.collect()


	# In[83]:

	num_shutdown_dupes = sum([len(filter(lambda e: e == 'shutdown', t[1])) for t in collected_clientid_reasons])
	print("Duplicate 'shutdown' pings percentage (by client id): {:.2f}%" .format(pct(num_shutdown_dupes, ping_count)))


	# The multiple pings we're receiving for the same client id are mostly `new-profile` pings with reason `shutdown`. This is not too surprising, as most of the `new-profile` pings are being sent at shutdown (75%).
	#
	# But does the pingsender have anything to do with this? Let's attack the problem like this:
	#
	# - get a list of the "misbehaving" clients;
	# - take a peek at their pings (redact client ids/ids);
	# - figure out the next steps.

	# In[84]:

	sorted_clientid_dupes = sorted(collected_clientid_reasons, key=lambda k: len(k[1]), reverse=True)
	misbehaving_clients = [client_id for client_id, _ in sorted_clientid_dupes]
	to_redact = subset.filter(lambda p: p.get("clientId") == misbehaving_clients[1]).collect()
	# Leave out the client ids and doc ids.
	for i,_ in enumerate(to_redact):
	to_redact[i]['clientId'] = 'XXX'
	to_redact[i]['id'] = 'XXX'
	to_redact


	# Interesting! Looks like many of these pings are missing the pingsender header from the request, meaning:
	#
	# - that they are not being sent from the pingsender, even though they are generated at shutdown;
	# - that we generated a new-profile ping at shutdown, but failed to mark it as 'generated', and so we received more than one with a different document it.
	#
	# This leads to other questions:
	#
	# - How often do we send new-profile pings at shutdown, fail, and then send them again without the pingsender?
	# - Does that correlate with the duplicates?

	# In[85]:

	newprofile_shutdown_from_bad_clients = subset.filter(lambda p: p.get("payload/reason") == 'shutdown') .filter(lambda p: p.get("clientId") in misbehaving_clients)

	newprofile_shutdown_from_good_clients = subset.filter(lambda p: p.get("payload/reason") == 'shutdown') .filter(lambda p: p.get("clientId") not in misbehaving_clients)


	# In[86]:

	newprofile_shutdown_from_bad_clients .map(lambda p: p.get("meta/X-PingSender-Version")).countByValue()


	# In[87]:

	newprofile_shutdown_from_good_clients .map(lambda p: p.get("meta/X-PingSender-Version")).countByValue()


	# Sending `new-profile` pings with reason `shutdown` seems to happen half of the times with misbehaving clients and about 10% of the times with well behaving clients. The pingsender doesn't seem to be the issue here: if we generate a ping at shutdown and try to send it with the pingsender, and fail, then it's normal for Firefox to pick it back up and send it. As long as we don't generate a new, different, `new-profile` ping for the same client. And this is something that well behaving clients don't do, so that 10% should be safe.

	# ### Does the profileCreationDate match the date we received the pings?

	# In[88]:

	def datetime_from_daysepoch(days_from_epoch):
	return datetime(1970, 1, 1, 0, 0) + timedelta(days=days_from_epoch)

	def datetime_from_nanosepoch(nanos_from_epoch):
	return datetime.fromtimestamp(nanos_from_epoch / 1000.0 / 1000.0 / 1000.0)

	def get_times(p):
	profile_creation = datetime_from_daysepoch(p["environment/profile/creationDate"]) if p["environment/profile/creationDate"] else None
	ping_creation = datetime_from_nanosepoch(p["meta/creationTimestamp"])
	ping_recv = datetime_from_nanosepoch(p["meta/Timestamp"])

	return (p["id"], profile_creation, ping_creation, ping_recv)

	ping_times = subset.map(get_times)


	# In[89]:

	ping_creation_delay_days = ping_times.filter(lambda p: p[1] is not None) .map(lambda p: abs((p[1].date() - p[2].date()).days)).collect()


	# In[90]:

	plt.title("The distribution of the days between the profile creationDate and the 'new-profile' ping creation date")
	plt.xlabel("Difference in days")
	plt.ylabel("Frequency")

	CLIP_DAY = 30
	plt.xticks(range(0, CLIP_DAY + 1, 1))
	plt.hist(np.clip(ping_creation_delay_days, 0, CLIP_DAY),
	alpha=0.5, bins=50, label="Delays")


	# In[91]:

	np.percentile(np.array(ping_creation_delay_days), [50, 70, 80, 95, 99])


	# The plot shows that most of the creation dates for `new-profile` pings match exactly with the date reported in the environment, `creationDate`. That's good, as this ping should be created very close to the profile creation. The percentile computation confirms that's true for 70% of the `new-profile` pings.
	#
	# However, things get tricky. The `new-profile` ping was enabled on the 30th of May, 2017 and should only be sent my new profiles. A delay longer than 7 days suggests that either the profile `creationDate` is terribly wrong or that we're sending the `new-profile` ping from existing profiles as well.
	#
	# ### Cross-check the `new-profile` and `main` pings

	# In[92]:

	main_pings = Dataset.from_source("telemetry") .where(docType='main') .where(appUpdateChannel="nightly") .where(submissionDate=lambda x: "20170531" <= x < "20170607") .where(appBuildId=lambda x: "20170531" <= x < "20170607") .records(sc, sample=1.0)


	# In[93]:

	main_subset = get_pings_properties(main_pings, ["id",
	"meta/creationTimestamp",
	"meta/Date",
	"meta/Timestamp",
	"meta/X-PingSender-Version",
	"clientId",
	"environment/profile/creationDate",
	"payload/info/reason",
	"payload/info/sessionLength",
	"payload/info/subsessionLength",
	"payload/info/profileSubsessionCounter",
	"payload/info/previousSessionId"])


	# Dedupe by document id and restrict the `main` ping data to the pings from the misbehaving and well behaving clients.

	# In[94]:

	well_behaving_clients = set(subset.filter(lambda p: p.get("clientId") not in misbehaving_clients).map(lambda p: p.get("clientId")).collect())

	all_clients = misbehaving_clients + list(well_behaving_clients)


	# In[95]:

	main_deduped = dedupe(main_subset.filter(lambda p: p.get("clientId") in all_clients), "id")
	main_deduped_count = main_deduped.count()


	# Try to pair each `new-profile` ping with reason `shutdown` to the very first `main` ping with reason `shutdown` received from that client, to make sure that the former were sent at the right time.

	# In[96]:

	first_main = main_deduped.filter(lambda p: p.get("payload/info/previousSessionId") == None and p.get("payload/info/reason") == "shutdown")


	# In[97]:

	newping_shutdown = deduped_docid.filter(lambda p: p.get("payload/reason") == "shutdown")


	# In[98]:

	newprofile_plus_main = first_main.union(newping_shutdown)
	sorted_per_client = newprofile_plus_main.map(lambda p: (p["clientId"], [(p, p["meta/creationTimestamp"])])) .reduceByKey(lambda a, b: sorted(a + b, key=lambda k: k[1])) .filter(lambda p: len(p[1]) > 1) .map(lambda p: (p[0], [r[0] for r in p[1]]))
	num_analysed_clients = sorted_per_client.count()


	# In[99]:

	HALF_HOUR_IN_S = 30 * 60

	def is_newprofile(p):
	# The 'main' ping has the reason field in 'payload/info/reason'
	return "payload/reason" in p and p.get("payload/reason") in ["startup", "shutdown"]

	def validate_newprofile_scheduling(client_data):
	ordered_pings = client_data[1]
	# If there's a new-profile ping with reason 'shutdown', look for the closest next
	# 'main' ping with reason shutdown.
	for i, p in enumerate(ordered_pings):
	# Skip until we find the 'new-profile' ping.
	if not is_newprofile(p):
	continue

	# We found the 'new-profile' ping. Do we have any other ping
	# after this?
	next_index = i + 1
	if next_index >= len(ordered_pings):
	return ("No more pings after the 'new-profile'", 1)

	# Is the next ping a 'main' ping with reason 'shutdown'?
	next_ping = ordered_pings[next_index]
	if is_newprofile(next_ping):
	return ("Duplicate 'new-profile' ping.", 1)

	if next_ping.get("payload/info/reason") == "shutdown":
	if next_ping.get("payload/info/sessionLength") <= HALF_HOUR_IN_S:
	return ("The 'new-profile' ping was correctly scheduled", 1)

	return ("The 'new-profile' ping was scheduled at the wrong time", 1)

	return ("No 'shutdown' new-profile ping found", 1)

	scheduling_error_counts = sorted_per_client.map(validate_newprofile_scheduling).countByKey()


	# In[100]:

	for error, count in scheduling_error_counts.items():
	print("{}:\t{}".format(error, pct(count, num_analysed_clients)))


	# Most of the `new-profile` pings sent at `shutdown`, 98.37%, were correctly generated because the session lasted less than 30 minutes. Only 0.88% were scheduled at the wrong time. The rest of the clients either sent the `new-profile` at startup or we're still waiting for their `main` ping with reason `shutdown`.

	# In[ ]: