nickyreinert/aa-export-import.ipynb

## aa-export-import.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Adobe Analytics Import-Export \n",
    "\n",
    "This notebook helps you to set up a little ETL process that connects two Adobe Analytics APIs: The Data Feed (**DF**) that exports raw tracking data and the Bulk Data Insertion API (**BDIA**) that allows you to import data.\n",
    "\n",
    "The goal is to move data from one report suite to another.\n",
    "\n",
    "* [Bulk Data Insertion API Doc](https://developer.adobe.com/analytics-apis/docs/2.0/guides/endpoints/bulk-data-insertion/)\n",
    "* [Data Feed Columns Doc](https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-reference.html?lang=en)\n",
    "\n",
    "## Process\n",
    "(If you want to keep track of the process, I recommend to add a new eVar that you can fill with a particular flag to identify the data later, like `prop42 = \"imported-at-2024-01-01\"`)\n",
    "\n",
    "1. Request the data using Analytics Data Feed interface\n",
    "   1. Set feed interval to \"daily\" and select your date range\n",
    "   2. Select the columns you need, at least those are\n",
    "   3. Activate GZIP compression\n",
    "   4. Activate \"Multiple Files\"\n",
    "   5. Manifest file is not required\n",
    "   6. Choose the files destination\n",
    "2. Once the export is done, you will get an E-Mail\n",
    "3. All **DF** files belong to the sub-folder `input`, this is the folder structure, this process expect one *file-set* per day:\n",
    "\n",
    "      ```\n",
    "      input/01-report-suite-name_2023-01-17.tsv\n",
    "      input/report-suite-name_2023-01-17-lookup_data/event.tsv   \n",
    "      input/report-suite-name_2023-01-17-lookup_data/browser_type.tsv\n",
    "      input/report-suite-name_2023-01-17-lookup_data/{...}.tsv\n",
    "      ```\n",
    "4. The target report suite needs to be configured to accept `timestamps` (this is one requirement of the **BDIA** to import data)\n",
    "   1. Go to \"Report Suite Manager\" and select the target Report Suite\n",
    "   2. Edit Settings > General > Timestamp Configuration\n",
    "   3. Check the box \"Timestamps optional\"\n",
    "\n",
    "5. Follow the notebook, it will result in a compressed file that you can to BDIA, but before that:\n",
    "6. Use the \"Validate\" section to compare input and output data with the actual data in Adobe Analytics\n",
    "8. If your data matches, go to  BDIA (https://adobedocs.github.io/analytics-2.0-apis/) log-in and select the BDIA endpoint\n",
    "9. start with `/events/validate` to check if your file is valid\n",
    "10. finally use the `/events` endpoint to actually process your data\n",
    "11. the field `x-adobe-vgid` points to your target report suite ID\n",
    "12. set `debug` to true and keep the CURL request if you need Adobe assistance afterwards"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Preparation\n",
    "## Libraries and helper functions\n",
    "\n",
    "Nothing special happens here, Explanation of the helper functions will follow later. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import csv\n",
    "import gzip\n",
    "import shutil\n",
    "import sys\n",
    "from io import StringIO\n",
    "\n",
    "pd.set_option('display.max_columns', None)\n",
    "\n",
    "def loadDataFeed(day, report_suite_name, num_rows, encoding):\n",
    "\n",
    "    file_import_columns = open(f\"input/{report_suite_name}_lookup_data/column_headers.tsv\", \"r\")\n",
    "    import_columns = list(csv.reader(file_import_columns, delimiter=\"\\t\"))[0]\n",
    "    file_import_columns.close()\n",
    "    \n",
    "    df = pd.read_csv(f\"input/01-{report_suite_name}_{day}.tsv\", \n",
    "        delimiter='\\t', \n",
    "        nrows=num_rows, \n",
    "        header=None,\n",
    "        names=import_columns,\n",
    "        dtype=str,\n",
    "        # encoding_errors='ignore',\n",
    "        encoding = encoding,\n",
    "        engine=\"c\")\n",
    "\n",
    "    # \"Dask\" performs a little better, switch, if you like\n",
    "    # with open(f\"input/01-{report_suite_name}_{day}.tsv\", encoding=\"utf8\", errors='ignore') as file:\n",
    "    #     csvString = file.read()\n",
    "            \n",
    "    #     csvStringIO = StringIO(csvString)\n",
    "\n",
    "    #     df = dd.read_csv(csvStringIO, \n",
    "    #         delimiter='\\t', \n",
    "    #         header=None,\n",
    "    #         names=import_columns,\n",
    "    #         dtype=str,\n",
    "    #         encoding = encoding,\n",
    "    #         engine=\"c\")\n",
    "\n",
    "    return df\n",
    "\n",
    "def applyLookUpTables(df, report_suite_name, lookup_tables, day, encoding):\n",
    "\n",
    "    for column in lookup_tables:\n",
    "\n",
    "        print(f'\\t\\tmapping {column}')\n",
    "        lookup_table_file = lookup_tables[column]\n",
    "\n",
    "        df_lookup_table = pd.read_csv(f\"input/{report_suite_name}_{day}-lookup_data/{lookup_table_file}\", \n",
    "            delimiter='\\t', \n",
    "            header=None,\n",
    "            encoding = encoding,\n",
    "            encoding_errors='ignore',\n",
    "            engine=\"python\",\n",
    "            on_bad_lines = 'skip')\n",
    "\n",
    "        lookup_table = pd.Series(df_lookup_table[1].values, index = df_lookup_table[0].astype(str)).to_dict()\n",
    "\n",
    "        df[column] = df[column].map(lookup_table)\n",
    "        \n",
    "    return df  \n",
    "\n",
    "def mapEventIds(df, report_suite_name, day, encoding):\n",
    "\n",
    "    file = open(f\"input/{report_suite_name}_{day}-lookup_data/event.tsv\", \"r\", \n",
    "        encoding=encoding)\n",
    "    lookup_table_all = dict(csv.reader(file, delimiter=\"\\t\"))\n",
    "    file.close()\n",
    "\n",
    "    # filter the lookup table to focus on custom events only\n",
    "    lookup_table = {event_key : lookup_table_all[event_key].replace('Custom Event ', '') for event_key in lookup_table_all if 'Custom Event ' in lookup_table_all[event_key]}\n",
    "\n",
    "    def map_event_ids(event_ids):\n",
    "        \n",
    "        # only process, if there are events\n",
    "        if event_ids != None and str(event_ids) != 'nan':\n",
    "\n",
    "            event_ids_mapped = []\n",
    "            # split to get a list of events from the source column\n",
    "            event_ids = str(event_ids).split(',')\n",
    "            for event_id in event_ids:\n",
    "\n",
    "                # check if an increment event or counter event is present\n",
    "                if event_id in lookup_table:\n",
    "                    if '=' in event_id:\n",
    "                        # results in \"1=123\"\n",
    "                        event_ids_mapped.append('event' + lookup_table[event_id.split('=')[0]] + '=' + event_id.split('=')[1])\n",
    "                    else:\n",
    "                        # results in \"1\"\n",
    "                        event_ids_mapped.append('event' + lookup_table[event_id])\n",
    "            # join all events into a string separated by comma\n",
    "            return ','.join(event_ids_mapped)\n",
    "        else:\n",
    "            return None\n",
    "\n",
    "    df['event_list'] = df['event_list'].apply(map_event_ids)\n",
    "\n",
    "    return df\n",
    "\n",
    "def renameColumns(df, export_columns, export_columns_iterative, keep_iterative_columns):\n",
    "\n",
    "    # switch key and value in export_columns\n",
    "    rename_columns = {export_columns[column]: column for column in export_columns if export_columns[column] is not None and not isinstance(export_columns[column], list)}\n",
    "\n",
    "    # first drop columns that are not required\n",
    "    for column in df.columns:\n",
    "        # all columns that are not in our rename list\n",
    "        # and all columns that are not \"iterative\" columns like prop1, prop2, ...\n",
    "        # and all columns that are in our export list but have no rename value\n",
    "        if (column not in rename_columns and column[:4] not in keep_iterative_columns) or \\\n",
    "        (column in export_columns and export_columns[column] is None) or \\\n",
    "        column in always_drop:\n",
    "\n",
    "            # BUT(!) do keep hit event type columns, we need them later\n",
    "            if column not in hit_event_type_columns:\n",
    "    \n",
    "                df = df.drop(columns=column)\n",
    "\n",
    "    # rename remaining columns\n",
    "    df = df.rename(columns=rename_columns)\n",
    "    \n",
    "    # rename iterative columns (props, evars, ...)\n",
    "    for column in export_columns_iterative:\n",
    "\n",
    "        for subcolumn in export_columns_iterative[column]:\n",
    "\n",
    "            for i in range(export_columns_iterative[column][subcolumn][0], export_columns_iterative[column][subcolumn][1] + 1):\n",
    "\n",
    "                df = df.rename(columns={f\"{column}{i}\": f\"{subcolumn}{i}\"})\n",
    "\n",
    "    return df\n",
    "\n",
    "def mapPageEvents(df):\n",
    "\n",
    "    # https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-page-event.html?lang=en\n",
    "    # we probably only need those events\n",
    "    #   page_event = \n",
    "    #       0: page view\n",
    "    #       10: Custom links\n",
    "    #       11: Download links\n",
    "    #       12: Exit links\n",
    "    #       70: Hit includes Target activity data\n",
    "    # if not 0 or 70, linkName needs to be provided\n",
    "    # if linkName is not empty, linkType needs to be provided\n",
    "    # linkType = d, e, o\n",
    "    # \n",
    "\n",
    "    # DataFeed                              BDIA\n",
    "    # page_event [0,10,11,12]               linkType [d,e,o]\n",
    "    # page_event_var1 for Download Links    Link URL\n",
    "    # page_event_var2 for Custom Links      Link Name\n",
    "\n",
    "    page_events = {'10': 'o', '11': 'd', '12': 'e'}\n",
    " \n",
    "    df['linkType'] = df['linkType'].map(page_events)\n",
    "\n",
    "    return df\n",
    "\n",
    "def searchAllColumnsForAValue(df, value):\n",
    "\n",
    "    \"\"\"\n",
    "    search all columns for a value\n",
    "    return a list of columns that contain the value\n",
    "    \"\"\"\n",
    "    \n",
    "    columns = []\n",
    "\n",
    "    for column in df.columns:\n",
    "        if df[column].str.contains(value).any():\n",
    "            columns.append(column)\n",
    "\n",
    "    return columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parameters\n",
    "\n",
    "The `exportcolumns` dict maps the column  names from the DF to the BDIA. If you set a column to `None` it will not be taken over. Two important things to mention:\n",
    "\n",
    "* the import file **must** contain a timestamp column!\n",
    "* the import file **must** contain some kind of user identification, either the MC Visitor ID, IP-Address, the visitorID or a self-managed customer ID\n",
    "\n",
    "Not all columns are mapped. This is a *minimal* working setup. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "export_columns = {\n",
    "    # The timestamp column is _mandatory_ for the BDIA\n",
    "    'timestamp':                'hit_time_gmt', # The timestamp of the hit Adobe data collection servers received the hit, based in Unix time\n",
    "\n",
    "    # Every hit must provide some kind of user identifier, we usually work with the Marketing Cloud Visitor ID\n",
    "    'marketingCloudVisitorID':  'mcvisid', # Experience Cloud Visitor ID. 128-bit number consisting of two concatenated 64-bit numbers padded to 19 digits.\t\n",
    "    'ipaddress':                'ip', # The IPv4 address, based on the HTTP header of the image request. Mutually exclusive to ipv6; if this column contains a non-obfuscated IP address, ipv6 is blank.\t\n",
    "    'visitorID':                None, # only one id-attribute is required, either marketingCloudVisitorID, visitorID, customerID or ip address\n",
    "    'customerID.[customerIDType].id':           None,\n",
    "    'customerID.[customerIDType].authState':    None,\n",
    "    'customerID.[customerIDType].isMCSeed':     None,\n",
    "\n",
    "    # technical and browser information\n",
    "    'hints.platform':           'os', # TODO: does this actually fit?\n",
    "    'connectionType':           'connection_type',\n",
    "    'userAgent':                'user_agent',\n",
    "    'browserHeight':            'browser_height',\n",
    "    'browserWidth':             'browser_width',\n",
    "    'colorDepth':               'color', # map to color_depth.tsv\n",
    "    'resolution':               'resolution',\n",
    "    'javaEnabled':              'java_enabled',\n",
    "    'cookiesEnabled':           'cookies',\n",
    "    'currencyCode':             'currency',\n",
    "    'zip':                      'geo_zip',\n",
    "\n",
    "    # web site information\n",
    "    'server':                   'user_server',\n",
    "    'pageName':                 'pagename',\n",
    "    'pageURL':                  'page_url',\n",
    "    'pageType':                 'page_type',\n",
    "    'referrer':                 'referrer',\n",
    "    'campaign':                 'campaign',\n",
    "    'channel':                  'channel',\n",
    "    'language':                 'language',\n",
    "    'purchaseID':               'purchaseid',\n",
    "\n",
    "    # custom events\n",
    "    'events':                   'event_list', # map to event.tsv, contains custom events from event1 to event1000\n",
    "    \n",
    "    # hit type (Page View or Click)\n",
    "    'linkType':                 'page_event', # page event helps identify Page Views (0) or clicks (10, 11, 12) or other Adobe events, see https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-page-event.html?lang=en\n",
    "    'linkName':                 'page_event_var2', # The custom name (if specified) of the link\n",
    "    'linkURL':                  'page_event_var1', # The URL of the download link, exit link, or custom link clicked\n",
    "    \n",
    "    # most columns are not required and therefore omitted, if you use Adobe Target you may consider using the TNTA column\n",
    "    'aamlh':                    None, # Adobe Audience Manager location hint, probably not required\n",
    "    'tnta':                     None,\n",
    "    'transactionID':            None,\n",
    "    'hints.platformversion':    None,\n",
    "    'hints.wow64':              None,\n",
    "    'products':                 None,\n",
    "    'queryString':              None,\n",
    "    'hints.architecture':       None,\n",
    "    'hints.bitness':            None,\n",
    "    'hints.brands':             None,\n",
    "    'hints.mobile':             None,\n",
    "    'hints.model':              None,\n",
    "    'trackingServer':           None,\n",
    "    'contextData.key':          None\n",
    "}\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Adobe does not provide the exact same column names for \"iterative columns\" (Props, eVars, lists and hierarchies). We need to pre-process them. I will explain the process in detail later. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "keep_iterative_columns = ['prop', 'mvva', 'hier', 'evar'] # when renaming columns, skip those, because they need special treatment\n",
    "\n",
    "export_columns_iterative = {\n",
    "    'prop': {'prop': [1, 75]},\n",
    "    'mvvar': {'list': [1, 3]},\n",
    "    'hier': {'hier': [1, 5]},\n",
    "    'evar': {'eVar': [1, 250]}\n",
    "}\n",
    "\n",
    "hit_event_type_columns = ['page_event', 'page_event_var1', 'page_event_var2']\n",
    "always_drop = ['mvvar1_instances', 'mvvar2_instances', 'mvvar3_instances'] # if you know that you need those columns, empty this list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Adobe's Data Feed provides \"relational data\", meaning: Some columns contain IDs that can be mapped to actual, readable values. Depending on your requirements, you can comment-out the lookups that you do not need. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lookup_tables = {\n",
    "    # 'browser':                'browser.tsv', # not in the list above\n",
    "    'color':                    'color_depth.tsv',\n",
    "    'connection_type':          'connection_type.tsv',\n",
    "    'language':                 'languages.tsv',\n",
    "    'os':                       'operating_systems.tsv',\n",
    "    'resolution':               'resolution.tsv',\n",
    "    # 'browser_type':           'browser_type.tsv', # no column in source fits this?!\n",
    "    # 'event_list':             'event.tsv',\n",
    "    # 'country.tsv':            'country'\n",
    "    # 'ref_type':               'referrer_type.tsv' # probably not required\n",
    "    # 'search_engines.tsv':     'search_engines' # probably not required\n",
    "    # 'plugins.tsv':            'plugins', # probably not required\n",
    "    # 'javascript_version.tsv': 'javascript_version', # probably not required\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Processing Data\n",
    "## Loading and pre-processing\n",
    "This part combines multiple steps in a loop for all given dates:\n",
    "* load data from the tab separated source files\n",
    "* remove \"invalid\" rows\n",
    "* apply lookup tables\n",
    "* map event IDs: \n",
    "    * because the data feed also applied a lookup table to events, we'll have to map those columns back\n",
    "    * the lookup table is located in `events.tsv` and contains some defaults events, instances of eVars and custom events, **please note**: we stick with the latter ones only!\n",
    "    * every row may contain more than one event, we also take care of currency events, like `event42=23`\n",
    "* the data should already be in order, nevertheless, make that sort order by `hit_time_gmt` is present, because BDIA requires it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "source_report_suite_name = \"source\"\n",
    "target_report_suite_name = \"target\"\n",
    "\n",
    "days = ['2024-01-01', '2024-01-01']\n",
    "encoding = \"utf-8\"\n",
    "num_rows = 99999999\n",
    "df_raw = {}\n",
    "df = {}\n",
    "\n",
    "for day in days:\n",
    "    print(f'loading {day}')\n",
    "    df_raw[day] = loadDataFeed(day, num_rows=num_rows, report_suite_name=source_report_suite_name, encoding=encoding)\n",
    "    \n",
    "    df_raw[day] = df_raw[day].dropna(subset=['hit_time_gmt']) # this column should not be empty, if it is, the row is not useful for us\n",
    "\n",
    "    df[day] = applyLookUpTables(df=df_raw[day].copy(), report_suite_name=source_report_suite_name, lookup_tables=lookup_tables, day=day, encoding=encoding)\n",
    "    \n",
    "    df[day] = mapEventIds(df=df[day], report_suite_name=source_report_suite_name, day=day, encoding=encoding)\n",
    "    \n",
    "    df[day] = df[day].sort_values('hit_time_gmt').reset_index(drop=True)\n",
    "    \n",
    "    print(f'\\tloaded {len(df[day])} rows')\n",
    "    \n",
    "# Write the Pickle\n",
    "df = pd.concat(df.values(), axis=0)\n",
    "df.to_pickle('df.pkl')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clean up\n",
    "Now we are going to clean up the data by first removing empty rows and rows that Adobe would ignore anyways. (If you want to remove other \"invalid rows\", see section below named \"Find invalid hits\".)\n",
    "\n",
    "Now we also have the chance to remove other unwanted content based on a search criteria, in this case every hit recored on a particular host:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.dropna(axis=1, how='all')\n",
    "df = df.loc[df['exclude_hit'] == '11']\n",
    "# df = df.loc[df['user_server'] == 'example.com']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Column renaming\n",
    "For a deeper understanding you should refer to the inline-comments of the function `renameColumns`. This step, apparently, renames the columns according to our definitions from above. It also takes care of our \"iterative columns\" like props, Evars and so on. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_export = renameColumns(df=df.copy(), \n",
    "                   export_columns=export_columns, \n",
    "                   export_columns_iterative=export_columns_iterative, \n",
    "                   keep_iterative_columns=keep_iterative_columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Map Events\n",
    "The term \"event\" may be misleading: This does not cover success events, but so called *page events* ([see doc](https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-page-event.html?lang=en)). Adobe has an ID mapping for them: \n",
    "- 0: Page View\n",
    "- 10: Custom links\n",
    "- 11: Download links\n",
    "- 12: Exit links\n",
    "- 70: Hit includes Target activity data\n",
    "\n",
    "Events from 10 to 12 always come with a value in `page_event_var2` aka `linkName` and if `page_event_var2` is **not** empty, `linkType` aka `page_event` needs to be provided. And `linkType` can be:\n",
    "- d for download\n",
    "- e for exit\n",
    "- o for other link\n",
    "\n",
    "In short (excl. Target):\n",
    "\n",
    "        DataFeed                    BDIA\n",
    "        - page_event [0,10,11,12]   linkType [d,e,o]\n",
    "        - page_event_var1           linkURL\n",
    "        - page_event_var2           linkName\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_export = mapPageEvents(df=df_export)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Export\n",
    "Before we write the data to disk, as above mentioned, we will fill our new eVar with a flag to identify the data. We also **must** provide a report suide name/ID for the import.\n",
    "\n",
    "Finally we will write three files:\n",
    "- a Pickle file for later reference and data validation\n",
    "- a file with a part of the full data set to validate it against the BDIA validate endpint\n",
    "- a file with all the data we want to import\n",
    "\n",
    "The CSV files will be compressed. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_export['prop42'] = 'imported-at-2024-01-01'\n",
    "df_export['reportSuiteID'] = f'{target_report_suite_name}'\n",
    "\n",
    "# Write the Pickle\n",
    "df_export.to_pickle('df_export.pkl')\n",
    "\n",
    "# Write the full data set to disk\n",
    "df_export.to_csv(f'output/{target_report_suite_name}.csv', index=False)\n",
    "with open(f'output/{target_report_suite_name}.csv', 'rb') as f_in:\n",
    "    with gzip.open(f'output/{target_report_suite_name}.csv.gz', 'wb') as f_out:\n",
    "        shutil.copyfileobj(f_in, f_out) \n",
    "        \n",
    "# write sampled data to disk\n",
    "df_export_sampled = df_export.sample(1000).copy()\n",
    "df_export_sampled.to_csv(f'output/{target_report_suite_name}_sample.csv', index=False)\n",
    "with open(f'output/{target_report_suite_name}_sample.csv', 'rb') as f_in:\n",
    "    with gzip.open(f'output/{target_report_suite_name}_sample.csv.gz', 'wb') as f_out:\n",
    "        shutil.copyfileobj(f_in, f_out) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## (Encoding Troulbeshooting)\n",
    "If you have problems with the encoding, there are at least three ways to dealing with that:\n",
    "- detect existing encoding\n",
    "- skip when read_csv encounters encoding error\n",
    "- use a different engine for read_csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import chardet\n",
    "encoding = chardet.detect(df)['encoding']\n",
    "encoding    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_csv(f\"source.tsv\",\n",
    "    encoding_errors='ignore',\n",
    "    engine=\"c\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Validate Data\n",
    "This is an important step as it helps you to find inconsistencies. The following steps kind of re-produce Adobe's aggregation process and therefore is usefull to compare sums for different metrics/dimensions. \n",
    "\n",
    "Please not how we load two different data frames here, the latter one is \"import-ready\". We can compare both against each other **and** against the numbers from a Adobe Analytics dashboard."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_pickle('df.pkl')\n",
    "print(f'Loaded {len(df)} raw-like rows')\n",
    "\n",
    "df_export = pd.read_pickle('df_export.pkl')\n",
    "print(f'Loaded {len(df_export)} fully processed rows')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Look up a value"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# simple\n",
    "df.loc[df['prop42'] == 'the answer']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# a little more room\n",
    "df.loc[df['prop42'].str.contains('the', na=False)]['prop42']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# over all columns\n",
    "searchAllColumnsForAValue(df, 'kaufm')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Find invalid hits\n",
    "The data feed contains invalid hits, which are marked as \"invalid\", they will be ignored in the import. See [data feed column documentation](https://experienceleague.adobe.com/docs/analytics/export/analytics-data-feed/data-feed-contents/datafeeds-reference.html?lang=en) for details. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('exclude_hit')['exclude_hit'].count().sort_values(ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Group by column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('user_server')['user_server'].count().sort_values(ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Excluded hits statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('exclude_hit')['exclude_hit'].count().sort_values(ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Page Events (Page Views and Clicks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('page_event')['page_event'].count().sort_values(ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Count final page evens Page Views and Clicks)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "page_views = df_export['linkType'].isna().sum()\n",
    "\n",
    "print(f'page views: {page_views}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "exit_links = df_export.loc[df_export['linkType'] == 'e']['linkType'].count()\n",
    "other_links = df_export.loc[df_export['linkType'] == 'o']['linkType'].count()\n",
    "download_links = df_export.loc[df_export['linkType'] == 'd']['linkType'].count()\n",
    "\n",
    "print(f'instances of exit links: {exit_links}, other links: {other_links}, download links: {download_links}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understand different timestamp columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[['hit_time_gmt', 'last_hit_time_gmt', 'cust_hit_time_gmt', 'date_time', 't_time_info', 'visid_timestamp']].head(1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Investigate different zip columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[['zip', 'geo_zip']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check lookup table mapping"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df['resolution'].head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Count values for a given dimension"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.groupby('prop42')['prop42'].count().sort_values(ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Count unique visitors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "unique_visitors = df_export['marketingCloudVisitorID'].groupby(by=df_export['marketingCloudVisitorID']).count()\n",
    "\n",
    "print(f'unique visitors: {len(unique_visitors)}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analyze success events"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "events = df_export['events'].groupby(by=df_export['events']).count().head(6).sort_values(ascending=False)\n",
    "print(f'top 5 events:')\n",
    "events"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A little more sophisticated including page names..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "events = ['5', '23', '42']\n",
    "for event in events:\n",
    "\n",
    "    df_event = df_export[df_export['events'].str.contains(event)]\n",
    "    print(f'{len(df_event)} instances of event {event}')\n",
    "    df_eventByPageName = df_event['pageName'].groupby(by=df_event['pageName']).count().sort_values(ascending=False)\n",
    "    print(df_eventByPageName)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lookup a particular success event"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "event_registration = df_export.loc[df_export['events'].str.contains('event42')]['events'].count()\n",
    "\n",
    "print(f'an event (e42): {event_registration}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analyze page names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "page_names = df_export.groupby('pageName')['pageName'].count().sort_values(ascending=False)\n",
    "print(f'{len(page_names)} different page names')\n",
    "print(page_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analyze campaign information"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "campaigns = df_export.groupby('campaign')['campaign'].count().sort_values(ascending=False)\n",
    "print(f'{len(campaigns)} instances of evar0 (campaign)')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}