Skip to content

Instantly share code, notes, and snippets.

@dzeber
Last active February 9, 2017 18:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dzeber/5b8f75432cace22da1b78bbef7f1459f to your computer and use it in GitHub Desktop.
Save dzeber/5b8f75432cace22da1b78bbef7f1459f to your computer and use it in GitHub Desktop.

Cliqz TxP Data

The data for the Cliqz TxP experiment is spread across multiple data sources:

  • testpilot pings are submitted when the Cliqz add-on is enabled or disabled. Note: enabled pings are known to resubmit when the add-on gets updated.
  • testpilottest pings report when the user made in-content searches, as well as when the Cliqz add-on is installed/uninstalled/enabled/disabled. They also include a record of the Cliqz client ID which will be needed to join our data to the search data collected by Cliqz
  • main pings record the usual activity stats
  • all search data collected by the Cliqz add-on is managed by Cliqz. They are providing a daily updated CSV containing their complete data for profiles in the test.

Plan

The idea is to combine the data into multiple tables (Spark DataFrames) which will serve all of our analysis needs. Some will be pushed to STMO to provide access to everyone working on the Cliqz project.

Notes on Cliqz client IDs

  • Cliqz uses their own client ID, distinct from the UT clientId. The search data is keyed by Cliqz client ID. The link is provided in the testpilottest pings, where the cliqzSession field lists the Cliqz clientID.
  • Cliqz IDs are reported encrypted, and they are decrypted during server-side processing. The ID showing up in the cliqzSession field is the encrypted version, whereas the one used in the search dataset is the decrypted version.
  • The Cliqz client ID string, appearing both in the search table and the decrpyted cliqzSession fields, are actually a string of the form <clientID>|<datestamp>|<channelID>. Only the first portion of this string (before the first |) is of interest for us.

Test pilot data

These are base tables containing raw Test Pilot data for participants in the Cliqz study. The point of these is never to have to return to the raw pings. See this gist for some examples of raw TxP data, including some code for identifying pings related to Cliqz.

Table for testpilot pings, 1 row per ping.

Columns:

  • client_id: ping["clientId"]
  • date: ping["meta"]["submissionDate"]
  • client_timestamp: ping["creationDate"]
  • geo: ping["meta"]["geoCountry"]
  • locale: ping["environment"]["settings"]["locale"]
  • channel: ping["meta"]["normalizedChannel"]
  • os: ping["meta"]["os"]
  • telemetry_enabled: ping["environment"]["settings"]["telemetryEnabled"]
  • has_addon: "testpilot@cliqz.com" in ping["environment"]["addons"]["activeAddons"].keys()
  • cliqz_version: ping["environment"]["addons"]["activeAddons"]["testpilot@cliqz.com"]["version"]
  • event: ping["payload"]["events"][0]["event"]

Table for testpilottest pings, 1 row per ping.

Columns:

  • client_id: ping["clientId"]
  • cliqz_client_id: ping["payload"]["payload"]["cliqzSession"] (encrypted version)
  • session_id: ping["payload"]["payload"]["sessionId"]
  • subsession_id: ping["payload"]["payload"]["subsessionId"]
  • date: ping["meta"]["submissionDate"]
  • client_timestamp: ping["creationDate"]
  • geo: ping["meta"]["geoCountry"]
  • locale: ping["environment"]["settings"]["locale"]
  • channel: ping["meta"]["normalizedChannel"]
  • os: ping["meta"]["os"]
  • telemetry_enabled: ping["environment"]["settings"]["telemetryEnabled"]
  • has_addon: "testpilot@cliqz.com" in ping["environment"]["addons"]["activeAddons"].keys()
  • cliqz_version: ping["environment"]["addons"]["activeAddons"]["testpilot@cliqz.com"]["version"]
  • event: ping["payload"]["payload"]["event"]
  • content_search_engine: ping["payload"]["payload"]["contentSearch"] (note that this is only populated when the event is userVisitedEngineHost or userVisitedEngineResult)

Profile daily table

This table summarizes daily activity for profiles in the Cliqz experiment, combining data from main_summary and TxP. It includes data for all clients that have submitted testpilot or testpilottest pings, starting from two weeks before their earliest submission date. The table has 1 row per (clientID, date).

Steps to determine the set of clients:

  1. select client_id, min(date) as min_date from <union of both Txp tables> where locale = 'de' and geo = 'DE' and has_addon is true
  2. Collect the rows of main_summary for those clientIDs where submission_date >= min_date - 2 weeks.
  3. Aggregate stats by client/submission date.

Columns:

  • client_id
  • cliqz_client_id: the decrypted Cliqz client ID, ie. the result of applying the decryption algorithm to ping["payload"]["payload"]["cliqzSession"], and only retaining the portion before the first |
  • date: row["submission_date"]
  • has_cliqz: "testpilot@cliqz.com" is in the set of add-on IDs
  • cliqz_version: ping["environment"]["addons"]["activeAddons"]["testpilot@cliqz.com"]["version"]
  • channel: row["normalized_channel"] (maybe just pick the first on that date)
  • os: row["os"] (first on that date)
  • is_default_browser: row["is_default_browser"] (use the most commonly reported value on the date)
  • session_hours: sum(row["subsession_length"]) / 3600
  • search_default: row["default_search_engine"] (last on the date)
  • page_views: sum(row["total_uri_count"]) --- would be nice to have. It's listed in the MainSummary doc but does not seem to appear in the dataset on Spark.
  • search_counts: same structure as in main_summary, with count summed by engine and source.
  • cliqz_enabled: total count of enabled events from TxP for the profile on that date
  • cliqz_disabled: total count of disabled events from TxP for the profile on that date
  • test_enabled: total count of cliqzEnabled events from TxP (testpilottest) for the profile on that date
  • test_disabled: total count of cliqzDisabled events from TxP (testpilottest) for the profile on that date
  • test_installed: total count of cliqzInstalled events from TxP (testpilottest) for the profile on that date
  • test_uninstalled: total count of cliqzUninstalled events from TxP (testpilottest) for the profile on that date
  • content_search: total counts of content searches from testpilottest on that date (a Map keyed by engine)
  • content_search_result: total counts of content search results from testpilottest on that date (a Map keyed by engine)

Search data

All data involving search will be pulled in from a dataset provided by Cliqz, which available at s3://net-mozaws-prod-cliqz/. The data is in a single CSV, which contains all the data for TxP Cliqz participants running back to the begining of the experiment, and is updated daily.

The Cliqz search dataset contains 1 row per search event (per client/date). See this Rmd for more details on its contents.

We consider renaming the columns for clarity as follows:

  • cliqz_client_id: udid (decrypted Cliqz client ID --- actual client ID is everything before the first |)
  • date: start_time
  • access_point: entry_point (the search access point)
  • action: selection_type (the action taken - search using default search engine, visit to Cliqz result, URL visit)
  • selected_result_type: selection_source (the type of result that was selected from the dropdown bar, using Cliqz's coding scheme)
  • smartcliqz_type: selection_class (the type of SmartCliqz rich card that was shown, if any)
  • element_selected: selection_element (an indication of what part of the Cliqz cards was selected --- not sure what the coding scheme is)
  • were_browser_results_shown: final_result_list_contains_history (whether any results from places.db were shown, including bookmarks, history, tabs)
  • num_cliqz_results_shown: final_result_list_backend_result_count (number of results shown originating from the Cliqz backend)
  • selected_result_index: selection_index (the index of the result selected from the dropdown)
  • entry_length: selection_query_length (the final number of characters entered in the URLbar)
  • final_result_list_show_time: time to show the dropdown list of results
  • selection_time: time to make a selection after last character is typed
  • total_signal_count: total number of Cliqz telemetry events generated by this search
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment