The data for the Cliqz TxP experiment is spread across multiple data sources:
testpilot
pings are submitted when the Cliqz add-on is enabled or disabled. Note:enabled
pings are known to resubmit when the add-on gets updated.testpilottest
pings report when the user made in-content searches, as well as when the Cliqz add-on is installed/uninstalled/enabled/disabled. They also include a record of the Cliqz client ID which will be needed to join our data to the search data collected by Cliqzmain
pings record the usual activity stats- all search data collected by the Cliqz add-on is managed by Cliqz. They are providing a daily updated CSV containing their complete data for profiles in the test.
The idea is to combine the data into multiple tables (Spark DataFrames) which will serve all of our analysis needs. Some will be pushed to STMO to provide access to everyone working on the Cliqz project.
- Cliqz uses their own client ID, distinct from the UT
clientId
. The search data is keyed by Cliqz client ID. The link is provided in thetestpilottest
pings, where thecliqzSession
field lists the Cliqz clientID. - Cliqz IDs are reported encrypted, and they are decrypted during server-side processing. The ID showing up in the
cliqzSession
field is the encrypted version, whereas the one used in the search dataset is the decrypted version. - The Cliqz client ID string, appearing both in the search table and the decrpyted
cliqzSession
fields, are actually a string of the form<clientID>|<datestamp>|<channelID>
. Only the first portion of this string (before the first|
) is of interest for us.
These are base tables containing raw Test Pilot data for participants in the Cliqz study. The point of these is never to have to return to the raw pings. See this gist for some examples of raw TxP data, including some code for identifying pings related to Cliqz.
Table for testpilot
pings, 1 row per ping.
Columns:
client_id
:ping["clientId"]
date
:ping["meta"]["submissionDate"]
client_timestamp
:ping["creationDate"]
geo
:ping["meta"]["geoCountry"]
locale
:ping["environment"]["settings"]["locale"]
channel
:ping["meta"]["normalizedChannel"]
os
:ping["meta"]["os"]
telemetry_enabled
:ping["environment"]["settings"]["telemetryEnabled"]
has_addon
:"testpilot@cliqz.com" in ping["environment"]["addons"]["activeAddons"].keys()
cliqz_version
:ping["environment"]["addons"]["activeAddons"]["testpilot@cliqz.com"]["version"]
event
:ping["payload"]["events"][0]["event"]
Table for testpilottest
pings, 1 row per ping.
Columns:
client_id
:ping["clientId"]
cliqz_client_id
:ping["payload"]["payload"]["cliqzSession"]
(encrypted version)session_id
:ping["payload"]["payload"]["sessionId"]
subsession_id
:ping["payload"]["payload"]["subsessionId"]
date
:ping["meta"]["submissionDate"]
client_timestamp
:ping["creationDate"]
geo
:ping["meta"]["geoCountry"]
locale
:ping["environment"]["settings"]["locale"]
channel
:ping["meta"]["normalizedChannel"]
os
:ping["meta"]["os"]
telemetry_enabled
:ping["environment"]["settings"]["telemetryEnabled"]
has_addon
:"testpilot@cliqz.com" in ping["environment"]["addons"]["activeAddons"].keys()
cliqz_version
:ping["environment"]["addons"]["activeAddons"]["testpilot@cliqz.com"]["version"]
event
:ping["payload"]["payload"]["event"]
content_search_engine
:ping["payload"]["payload"]["contentSearch"]
(note that this is only populated when the event isuserVisitedEngineHost
oruserVisitedEngineResult
)
This table summarizes daily activity for profiles in the Cliqz experiment, combining data from main_summary
and TxP.
It includes data for all clients that have submitted testpilot
or testpilottest
pings, starting from two weeks before their earliest submission date.
The table has 1 row per (clientID, date).
Steps to determine the set of clients:
select client_id, min(date) as min_date from <union of both Txp tables> where locale = 'de' and geo = 'DE' and has_addon is true
- Collect the rows of
main_summary
for those clientIDs wheresubmission_date >= min_date - 2 weeks
. - Aggregate stats by client/submission date.
Columns:
client_id
cliqz_client_id
: the decrypted Cliqz client ID, ie. the result of applying the decryption algorithm toping["payload"]["payload"]["cliqzSession"]
, and only retaining the portion before the first|
date
:row["submission_date"]
has_cliqz
:"testpilot@cliqz.com"
is in the set of add-on IDscliqz_version
:ping["environment"]["addons"]["activeAddons"]["testpilot@cliqz.com"]["version"]
channel
:row["normalized_channel"]
(maybe just pick the first on that date)os
:row["os"]
(first on that date)is_default_browser
:row["is_default_browser"]
(use the most commonly reported value on the date)session_hours
:sum(row["subsession_length"]) / 3600
search_default
:row["default_search_engine"]
(last on the date)page_views
:sum(row["total_uri_count"])
--- would be nice to have. It's listed in the MainSummary doc but does not seem to appear in the dataset on Spark.search_counts
: same structure as inmain_summary
, withcount
summed byengine
andsource
.cliqz_enabled
: total count ofenabled
events from TxP for the profile on that datecliqz_disabled
: total count ofdisabled
events from TxP for the profile on that datetest_enabled
: total count ofcliqzEnabled
events from TxP (testpilottest
) for the profile on that datetest_disabled
: total count ofcliqzDisabled
events from TxP (testpilottest
) for the profile on that datetest_installed
: total count ofcliqzInstalled
events from TxP (testpilottest
) for the profile on that datetest_uninstalled
: total count ofcliqzUninstalled
events from TxP (testpilottest
) for the profile on that datecontent_search
: total counts of content searches fromtestpilottest
on that date (a Map keyed by engine)content_search_result
: total counts of content search results fromtestpilottest
on that date (a Map keyed by engine)
All data involving search will be pulled in from a dataset provided by Cliqz, which available at s3://net-mozaws-prod-cliqz/
. The data is in a single CSV, which contains all the data for TxP Cliqz participants running back to the begining of the experiment, and is updated daily.
The Cliqz search dataset contains 1 row per search event (per client/date). See this Rmd for more details on its contents.
We consider renaming the columns for clarity as follows:
cliqz_client_id
:udid
(decrypted Cliqz client ID --- actual client ID is everything before the first|
)date
:start_time
access_point
:entry_point
(the search access point)action
:selection_type
(the action taken - search using default search engine, visit to Cliqz result, URL visit)selected_result_type
:selection_source
(the type of result that was selected from the dropdown bar, using Cliqz's coding scheme)smartcliqz_type
:selection_class
(the type of SmartCliqz rich card that was shown, if any)element_selected
:selection_element
(an indication of what part of the Cliqz cards was selected --- not sure what the coding scheme is)were_browser_results_shown
:final_result_list_contains_history
(whether any results fromplaces.db
were shown, including bookmarks, history, tabs)num_cliqz_results_shown
:final_result_list_backend_result_count
(number of results shown originating from the Cliqz backend)selected_result_index
:selection_index
(the index of the result selected from the dropdown)entry_length
:selection_query_length
(the final number of characters entered in the URLbar)final_result_list_show_time
: time to show the dropdown list of resultsselection_time
: time to make a selection after last character is typedtotal_signal_count
: total number of Cliqz telemetry events generated by this search