MacHu-GWU/outbrain-data-explore-report.rst

## outbrain-data-explore-report.rst

      
    Raw
  

              outbrain-data-explore-report.rst
            
          
Outbrain data explore report


About the data.
github, currently private.


Understand the data


scale: find how many rows, how much disk space it cost.
primary_key: find those primary objects, like document, advertisement, event, user, etc, ...
cardinality: find unique number of values for each data points.
relationship: find relationship between each pair of data points, could be one-to-one, one-to-many, many-to-many
data quality: find how many missing values

We simple read the first 1000 rows and count lines for each csv file, and write everything into a single excel file. Now browse data is easy.
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import numpy as np, pandas as pd
from pathlib_mate import Path
from sfm import lines_count
from outbrain import config, database
from outbrain.database import engine, metadata
from outbrain.logger import logger

def merge_everything_to_schema():
    """

    **中文文档**

    汇总所有csv文件的Schema。

    每个文件读取前1000行，以及统计总行数。然后将多个文件的信息汇总到一个
    Excel表中, 以供方便观察Schema。
    """
    logger.info("Run merge_everything_to_schema()")
    writer = pd.ExcelWriter("Schema.xlsx")
    # select all .csv file
    for p in Path(config.data_dir).select_by_ext(".csv"):
        logger.info("process %s ..." % p.basename, 1)
        lines = lines_count.count_lines(p.abspath) - 1 # count lines
        sheet_name = p.fname + "_%s" % lines
        df = pd.read_csv(p.abspath, nrows=1000) # fetch first 1000 rows
        df.to_excel(writer, sheet_name, index=False) # create a spreadsheet

    writer.save() # write to excel

merge_everything_to_schema()

Explore the data

We noticed that the document_id is associated with ad_id and click_status via display_id in events.csv. And each document has category, topic, entity and other meta_data.
We believe the relationship is very important, and we also want to take advantage from database to boost our data-selection. Here we have some code to find out.
def investigate_documents_categories_csv():
    """Find out the document_id vs category_id is many-to-many.

    **中文文档**

    检查document_id和category_id的关系:

    结论: 多对多的关系。
    """
    p = Path("raw_data", "documents_categories.csv")
    df = pd.read_csv(p.abspath, usecols=["document_id", "category_id"])
    n_rows = df.shape[0]
    n_document_id = len(df["document_id"].unique())
    n_category_id = len(df["category_id"].unique())
    print(n_rows, n_document_id, n_category_id) # 5481475 2828649 97

investigate_documents_categories_csv()
ad_id exists in clicks_train.csv, clicks_test.csv and promoted_content.csv, and promoted_content.csv provides rich meta data. Is meta data is available for all ad_id in train and test data? The script tells me "YES"!
def investigate_ad_id():
    """

    **中文文档**

    由于在clicks_train, clicks_test和promoted_content三个文件中都有ad_id。
    promoted_content中的ad_id是unique的, 而另外两个文件中的不是。所以我们要检查
    clicks_train, clicks_test两个文件中的ad_id的集合是否有promoted_content中
    没有的元素。

    结论: promoted_content中的ad_id就是所有的ad_id。
    """
    p = Path(config.data_dir, "clicks_train.csv")
    df_train = pd.read_csv(p.abspath, usecols=[1,])

    p = Path(config.data_dir, "clicks_test.csv")
    df_test = pd.read_csv(p.abspath, usecols=[1,])

    p = Path(config.data_dir, "promoted_content.csv")
    df_promoted = pd.read_csv(p.abspath, usecols=[0,])

    s = set.union(set(df_train["ad_id"]), set(df_test["ad_id"])).\
        difference(set(df_promoted["ad_id"]))

    print(len(s)) # 0

investigate_ad_id()

Baseline Model

Since the clicks_trains provides the naive probability that an ad_id will be clicked. How about for any display_id, we order the ad_id by its probability? Let's have a try:
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import numpy as np, pandas as pd
from pathlib_mate import Path
from sfm.timer import Timer
from sfm.pandas_mate import IOTool
from outbrain import config, database
from outbrain.logger import logger


def naive_guess_based_on_probability():
    """

    Runtime 5 minutes.
    """
    table = dict()
    chunksize = 1000000

    with Timer():
        logger.info("calculate probability ...")
        p = Path(config.data_dir, "clicks_train.csv")
        for _, ad_id, click in IOTool.csv_tuple_iterator(p.abspath, chunksize):
            if click:
                try:
                    table[ad_id]["all"] += 1
                    table[ad_id]["clicked"] += 1
                except:
                    table[ad_id] = {"all": 1, "clicked": 1}
            else:
                try:
                    table[ad_id]["all"] += 1
                except:
                    table[ad_id] = {"all": 1, "clicked": 0}

    for _, d in table.items():
        d["prob"] = d["clicked"] * 1.0 / d["all"]

    with Timer():
        logger.info("group by display_id ...")
        display = dict()
        p = Path(config.data_dir, "clicks_test.csv")
        for display_id, ad_id in IOTool.csv_tuple_iterator(p.abspath, chunksize):
            try:
                display[display_id].append(ad_id)
            except:
                display[display_id] = [ad_id,]

    with Timer():
        logger.info("choose best ad_id ...")
        result = list()
        for display_id, ad_id_list in display.items():
            prob_table = [(ad_id, table.get(ad_id, {"prob": 0.0})["prob"]) for ad_id in ad_id_list]
            best_ad_id, best_prob = list(sorted(prob_table, key=lambda x: x[1], reverse=-1))[0]
            result.append((
                display_id,
                best_ad_id,
                best_prob,
                ", ".join([str(ad_id) for ad_id in ad_id_list])
            ))

        df = pd.DataFrame(result, columns=["display_id", "best_ad_id", "best_prob", "all_ad_id"])

    with Timer():
        logger.info("export result ...")
        df.to_csv("result.csv", index=False)

naive_guess_based_on_probability()