Skip to content

Instantly share code, notes, and snippets.

@tushar-rishav
Created December 5, 2016 11:28
Show Gist options
  • Save tushar-rishav/1627b443a957e0a59797d34a1b93c2b5 to your computer and use it in GitHub Desktop.
Save tushar-rishav/1627b443a957e0a59797d34a1b93c2b5 to your computer and use it in GitHub Desktop.
Implementation details

Dependencies

Faker - pip install Faker

Implementation Details

Programming language used:

Python

Files:
.
├── analyse_data.py
├── data
│   ├── airing.json
│   ├── channel.json
│   ├── program.json
│   ├── viewer.json
│   └── viewership.json
├── gen_data.py
├── helper
│   ├── Constant.py
│   ├── __init__.py
│   ├── IO.py
│   ├── Model.py
├── __init__.py

  • python gen_data.py generates the test database for data/ directory.
  • python analyse._data.py runs analysis on the generated database.
  • helper module contains IO.py, Model.py that facilitates disk read/write operation for json files and blueprint for database schema respectively.
Logic:
Part 1: Generate test database.
  • To generate distinct test viewer names, Faker is used. Unique vwr_id is being generated using uuid module in Python.

  • For each channel an Airing is generated by a random selection of the programs. This is carried for a month. Hence, a random combination of (channel, program) is created for a given time interval.

  • To generate test json files (database), a Model class (helper/Model.py) is created for each table.

Part 2. Process the test database.

Most of the part is simple dictionary processing. Some key points are:

  • The vship_data(Viewership table) is sorted with key as view_dt(view date). Next, the channels are filtered from the sorted data using the given time interval. (binary search paradigm)

  • _get_vship_with_duration: The filtered channels are used to search for the corresponding program details from air_data (Airing table). We store this information in a map chn_prog_map (channel_id -> (program_id, duration)). This also returns the viewer count for each channel as obtained from the viewership table.

  • We've also created a map of chn_id - > chn_name and pgm_id -> (pgm_genres, pgm_title) for efficient access.

  • For handling prime time, we implement the search on viewership table (first point) in time data with components smaller than a day (i.e hours, minutes etc).

New insights (proposal)
  1. Analyse viewership pattern for each viewer. Basically, get the insights on the most popular genre that a viewer is interested in during a given time of a day. This analysis can throw some insights on the viewer's mood during a day. Knowing the mood of a viewer at a given time is crucial information.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment