mcritchlow/modeling.md

## modeling.md

      
    Raw
  

              modeling.md
            
          
    Summary

The current model pattern for Google Analytics statistics support in Hyrax follows essentially the following database format:
  create_table "work_view_stats", force: :cascade do |t|
    t.datetime "date"
    t.integer "work_views"
    t.string "work_id"
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.integer "user_id"
    t.index ["user_id"], name: "index_work_view_stats_on_user_id"
    t.index ["work_id"], name: "index_work_view_stats_on_work_id"
  end

In this format, there is a single unique metric from Google Analytics stored, in this case work_views. The other important uniquely identifying information is a date, a work_id (or file_id) and a user_id.
Currently, there are three existing analytics tables: work_view_stats, file_view_stats and file_download_stats. This implies a pattern of "one database table per metric". The existing Hyrax::Statistic API assumes this. For example the to_flot method, which is relied upon by several statistics presenters, all call directly in to this method assuming it returns a single array data structure with a single date and metric. This cascades through the Statistics class.
In the current Analytics Sprint we need to track additional metrics, including:

Returning Visitors
Unique Visitors
Site-wide Unique Visitors
Site-wide Returning Visitors
Visibility
others?

The group also needs to support multiple remote analytics backends, for now specifically - Matomo
This leaves our group needing to make a decision about the pattern for modeling these metrics. It seems we have (at least) three options:

Continue the existing pattern of "one table per metric"
Create new tables that are more inclusive. Example WorkStat would include views, unique visitors, returning visitors, possibly visibility, etc.
Update existing tables to support the new attributes. So WorkViewStat might have the new attributes added to it, be renamed via a migration, etc.

Some thoughts on each below:
One table per metric

Pros:

Follows existing pattern and as a result, most of the backend code would continue to work as-is
New presenters created could leverage the same query patterns as existing presenters

Cons:

Seems wasteful from a database modeling perspective. Couple that with the existing rows with 0 entries, and it's adding up to a pretty large DB footprint over time.
Creates the need for several queries to be able to respond to a presenter saying "Give me all the statistics for Work 123 on February 26, 2018."
Site-wide metrics don't fit into this pattern where a user_id and work_id or page_id are used as primary attributes uniquely identifying a table row.

Brand new tables

Pros:

Allows for a potentially more ideal database model that more accurately maps to the needs of the front end query system
Should be more performant (how much is unknown)
Would allow modeling site-wide metrics differently (just using date as unique id/filter)
Would allow new remote caching code to be completely segregated from the existing Hyrax::Statistic class subclasses (could probably be a single delegation which we already have a hook in place for).

Cons:

Would require existing Hyrax users to completely rebuild their local "cache", at least eventually to be able to utilize the new reporting dashboard(s) and metrics.
Possibly a complicated migration path for users, would require thorough testing and very well documented migration path.
The Hyrax::Statistic code will need to change to not rely on a single column for data, such as the to_flot execution path

Merge/Update existing tables

Pros:

If done properly the migration path may be slightly less involved for end users. This seems particularly true for the work_view_stat table becoming a single table to hold all work metrics
Allows for a potentially more ideal database model that more accurately maps to the needs of the front end query system
Should be more performant (how much is unknown)
ActiveRecord has good support for renaming tables via the rename_table transformation.

Cons:

There are already two file statistics tables. So there is the question of which to merge into, and whether ultimately that is better, or any different, than just creating a new table.
The Hyrax::Statistic code will still need to change to not rely on a single column for data, such as the to_flot execution path
Possibly a complicated migration path for users, would require thorough testing and very well documented migration path.