Skip to content

Instantly share code, notes, and snippets.

@chutten
Created February 23, 2017 21:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chutten/786bbcca8f848ac65ad75daf9f1a24f5 to your computer and use it in GitHub Desktop.
Save chutten/786bbcca8f848ac65ad75daf9f1a24f5 to your computer and use it in GitHub Desktop.
problematic_client
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
# coding: utf-8
---
title: One Problematic Aurora 51 Client
authors:
- chutten
tags:
- aurora
- firefox
created_at: 2017-02-22
updated_at: 2017-02-23
tldr: Taking a look at one problematic client on Aurora leads to a broad examination of the types of hosts that are sending us this data and some seriously-speculative conclusions.
---
# ## One Problematic Aurora 51 Client
# ### Motivation
# There is one particular client, whose `client_id` I've obscured, that seems to be sending orders of magnitude more "main" pings per day than is expected, or even possible.
#
# I'm interested in figuring out what we can determine about this particular client to see if there are signifiers we can use to identify this anomalous use case. This identification would permit us to:
# * filter data from these clients out of derived datasets that aren't relevant
# * identify exceptional use-cases for Firefox we don't currently understand
# ### How many pings are we talking, here?
# In[1]:
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
from moztelemetry.dataset import Dataset
from moztelemetry import get_pings_properties, get_one_ping_per_client
# In[2]:
all_pings = Dataset.from_source("telemetry") .where(docType='main') .where(appBuildId=lambda x: x.startswith("20161014")) .where(appUpdateChannel="aurora") .records(sc, sample=1)
# In[5]:
pings = all_pings.filter(lambda p: p['clientId'] == '<omitted>')
# In[6]:
submission_dates = get_pings_properties(pings, ["meta/submissionDate"])
# In[7]:
from datetime import datetime
ping_counts = submission_dates.map(lambda p: (datetime.strptime(p["meta/submissionDate"], '%Y%m%d'), 1)).countByKey()
# In[8]:
from datetime import timedelta
# In[9]:
df = pd.DataFrame(ping_counts.items(), columns=["date", "count"]).set_index(["date"])
df.plot(figsize=(17, 7))
plt.xticks(np.arange(min(df.index), max(df.index) + timedelta(3), 3, dtype="datetime64[D]"))
plt.ylabel("ping count")
plt.xlabel("date")
plt.grid(True)
plt.show()
# Just about 100k main pings submitted by this client on a single day? (Feb 16)... that is one active client.
#
# Or _many_ active clients.
# ### What Can We Learn About These Pings?
#
# Well, since these pings all share the same clientId, they likely are sharing user profiles. This means things like profile `creationDate` and so forth won't change amongst them.
#
# However, here's a list of things that might change in interesting ways or otherwise shed some light on the purpose of these installs.
# In[13]:
subset = get_pings_properties(pings, [
"meta/documentId",
"meta/submissionDate",
"meta/geoCountry",
"meta/geoCity",
"environment/addons/activeAddons",
"environment/settings/isDefaultBrowser",
"environment/system/cpu/speedMHz",
"environment/system/os/name",
"environment/system/os/version",
"payload/info/sessionLength",
"payload/info/subsessionLength",
])
# In[14]:
subset.count()
# #### Non-System Addons
# In[18]:
pings_with_addon = subset .flatMap(lambda p: [(addon["name"], 1) for addon in filter(lambda x: "isSystem" not in x or not x["isSystem"], p["environment/addons/activeAddons"].values())]) .countByKey()
# In[19]:
sorted(pings_with_addon.items(), key=lambda x: x[1], reverse=True)[:5]
# Nearly every single ping is reporting that it has an addon called 'Random Agent Spoofer'. Interesting.
# #### Session Lengths
# In[20]:
SESSION_MAX = 400
# In[21]:
session_lengths = subset.map(lambda p: p["payload/info/sessionLength"] if p["payload/info/sessionLength"] < SESSION_MAX else SESSION_MAX).collect()
# In[22]:
pd.Series(session_lengths).hist(bins=250, figsize=(17, 7))
plt.ylabel("ping count")
plt.xlabel("session length in seconds")
plt.show()
# In[23]:
pd.Series(session_lengths).value_counts()[:10]
# The session lengths for over half of all the reported pings are exactly 215 seconds long. Two minutes and 35 seconds.
# #### Is this Firefox even the default browser?
# In[24]:
subset.map(lambda p: (p["environment/settings/isDefaultBrowser"], 1)).countByKey()
# No.
# #### CPU speed
# In[25]:
MHZ_MAX = 5000
# In[26]:
mhzes = subset.map(lambda p: p["environment/system/cpu/speedMHz"] if p["environment/system/cpu/speedMHz"] < MHZ_MAX else MHZ_MAX).collect()
# In[27]:
ds = pd.Series(mhzes)
ds.hist(bins=250, figsize=(17, 7))
plt.ylabel("ping count (log)")
plt.xlabel("speed in MHz")
plt.yscale("log")
plt.show()
# In[28]:
pd.Series(mhzes).value_counts()[:10]
# There seems to be a family gathering of different hardware configurations this client is running on, most on a particular approximately-3.5GHz machine
# #### Operating System
# In[29]:
def major_minor(version_string):
return version_string.split('.')[0] + '.' + version_string.split('.')[1]
# In[30]:
pings_per_os = subset .map(lambda p: (p["environment/system/os/name"] + " " + major_minor(p["environment/system/os/version"]), 1)) .countByKey()
# In[31]:
print len(pings_per_os)
sorted(pings_per_os.items(), key=lambda x: x[1], reverse=True)[:10]
# All of the pings come from Windows XP.
# #### Physical Location (geo-ip of submitting host)
# In[32]:
pings_per_city = subset .map(lambda p: (p["meta/geoCountry"] + " " + p["meta/geoCity"], 1)) .countByKey()
# In[33]:
print len(pings_per_city)
sorted(pings_per_city.items(), key=lambda x: x[1], reverse=True)[:10]
# These pings are coming from all over the world, mostly from countries where Firefox user share is already decent. This may just be a map of Browser use across the world's population, which would be consistent with a profile that is inhabiting a set %ge of the browser-using population's computers.
# #### Document IDs
# In[34]:
pings_per_docid = subset .map(lambda p: (p["meta/documentId"], 1)) .countByKey()
# In[35]:
print len(pings_per_docid)
sorted(pings_per_docid.items(), key=lambda x: x[1], reverse=True)[:10]
# That same half of pings reporting 215s sessions? The same ping. Maybe a ping stored in the profile directory before it was distributed? But then how are we seeing hundreds-of-thousands of copies of seven others, too...
# ### Conclusion
#
# None of this is concrete, but if I were invited to speculate, I'd think there's some non-Mozilla code someplace that has embedded a particular (out-of-date) version of Firefox Developer Edition into themselves, automating it to perform a 2-minute-and-35-second task on Windows XP machines, possibly while masquerading as something completely different (using the addon).
#
# This could be legitimate. Firefox contains a robust networking and rendering stack so it might be desireable to embed it within, say, a video game as a fully-featured embedded browser. The user-agent-spoofing addon could very well be used to set a custom user agent to identify the video game's browser, and of course it wouldn't be the user's default browser.
#
# However, I can't so easily explain this client's broad geographical presence and Windows XP focus.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment