The current schema of the DW optimizes for reporting on the state of a student at the time they performed work rather than their current state. This is primarly caused by the decision to model most of our dimensions as Type 2 SCDs. As students change or move, we write additional rows to the student_dimensions table to represent the new state of that student and leave the original row untouched. For example:
student_dimensions
id | source_id | name | is_active | is_latest | effective_date |
---|---|---|---|---|---|
1 | 7 | Joel | true | false | 08-01-2015 |
2 | 7 | Joel M | true | false | 08-09-2015 |
3 | 7 | Joel M | false | true | 08-25-2015 |
Here, a single student with apangea ID 7 has been modified three times. They were created on 8/1, had their name changed on 8/9, and were deactivated on 8/25. We have an audit log, but let's see how this makes some seemingly simply queries difficult when we include a fact table.
lesson_completed_facts
student_id | lesson_id | post_quiz_score |
---|---|---|
1 | 10 | 0.85 |
2 | 17 | 0.67 |
2 | 77 | 1.0 |
Here, our student has finished 3 lessons. One was done when the first version of the student dimension record was current, and the other two were completed after the name change (because it is using student_id 2 instead of student_id 1).
Now, let's say we want to run simple query to see how many lessons have been completed by active students. Naively, we'd write the following query:
SELECT
count(*)
FROM
lesson_completed_facts
join student_dimensions on student_dimenions.id = lesson_completed_facts.student_id
WHERE
student_dimensions.is_active = true
According to our reporting guidelines, we want this query to return nothing, since our only student is currently inactive. However, the answer we get is 3! Why? Because we're joining to versions of the student that were active at the time the lesson was completed instead of the most recent version of the student.
To compensate, the warehouse currently goes to great lenghts to compensate for this by doing something like this:
SELECT
count(*)
FROM
lesson_completed_facts
join student_dimensions on student_dimenions.id = lesson_completed_facts.student_id
join student_dimensions latest_student on latest_student.source_id = student_dimensions.source_id and latest_student.is_latest
WHERE
latest_student.is_active = true
In addition to being complex, this is also slow. Also, this isn't applied consistently in all queries because the developer has to remember to do it. We can improve this situtation have two tables for students the original Type 2 SCD, and a new Type 1 SCD that only contains the most recent version of each record. Then, if the facts were to contain both the historical id and the source id, most queries would work as intended without having to think about this is_latest complexity. Extending the example above:
If we were to rename student_dimensions above to historical_student_dimensions and created a new table that looked like:
student_dimensions
id | name | is_active |
---|---|---|
7 | Joel M | false |
And also change the fact table to look like:
lesson_completed_facts
historical_student_id | student_id | lesson_id | post_quiz_score |
---|---|---|---|
1 | 7 | 8 | 0.85 |
2 | 7 | 17 | 0.67 |
2 | 7 | 77 | 1.0 |
Then our original query:
SELECT
count(*)
FROM
lesson_completed_facts
join student_dimensions on student_dimenions.id = lesson_completed_facts.student_id
WHERE
student_dimensions.is_active = true
would correctly return 0 rows, but yet we still preserve the historical student context if it's needed.
Dimensions affected:
- customer_dimensions
- district_dimensions
- school_dimensions
- classroom_dimensions
- student_dimensions
- teacher_dimensions
- donateable_dimensions *
- item_dimensions *
- pathway_dimensions *
- standard_dimensions *
Changes to existing tables:
- Rename existing dimensions to be prefixed with historical_
- Continue to update historical_ tables as type 2 SCDs via ETL.
New Tables:
- Generate new tables that are duplicates of the historical_ tables except they don't have is_latest, effective_start, effective_end columns
- The primary key is the apangea source id, not the surrogate id
- Updated as a Type 1 SCD via ETL
Incremental ideas:
- Include is_latest=true on all non-historical tables to begin with so the queries don't know the difference.
- Have feature flag that swaps which version of the dimensions we join to: historical or regular
Fact changes:
- Rename existing dimension columns on all facts to have historical_ prefix
- Add columns columns to all facts that contain source ids for all these dimensions
Query changes:
- Never join to historical records ever.
- Remove all is_latest stuff, everwhere.
- Remove double joins to get latest student records for filtering
Asterisk tables:
I question the value of keeping history on the dimensions marked with asterisks above, since their history doesn't directly contribute to any known requirement. If warranted, we could remove the Type 2 historical table for those entities entirely and just directly update the Type 1 table via ETL.
Apangea needs to explicitly capture the following move events:
- Students moving between classrooms (which might be in another school, district)
- Update classroom_id, school_id, district_id on all affected facts
- Classrooms moving across schools (which might be in another district)
- Update school_id, district_id on all affected facts
- Schools moving across districts (which might be to no district at all)
- Update district_id on all affected facts
A simple UI should be developed in apangea so that the move history of a student/classroom/school/district can be seen by the dev and support teams.
The ETL process needs consume new apangea "entity moved" events, and update all related facts accordingly.
- Students moving between classrooms
- Update classroom_id, school_id, district_id on all affected facts
- Classrooms moving across schools
- Update school_id, district_id on all affected facts
- Schools moving across districts
- Update district_id on all affected facts
Tables affected: attempt_facts attempt_by_date_student_classoom_aggregates donation_facts (???? do we want classroom goal donations to move?) hint_view_facts lesson_completed_facts live_help_facts placement_attempt_facts placement_test_completed_facts
The aggregate tables are rebuilt every night, and so we should explain that moves will not be reflected in aggregate until the next school day.
In order to support the guiding principle that students with work over the configured date range should be denominator for all averages, we need an easy way to see if a student has completed a single attempt, placement attempt, or benchmark attempt in that date range.
We already have a daily rollup of the attempt_fact table that gets us close. If we extended it to include placement attempts and benchmark attempts a simple check to see if a row exists in that table over the configured date range for the student in question would answer the question.
- Extend attempt_by_date_student_classoom_aggregates to include the number of placement attempts, benchmark attempts, and lessons completed each day.
- Rename attempt_by_date_student_classoom_aggregates to student_work_by_date (or something)
- We might be able to merge this concept with the PointLeaderboard aggregate table?
- Update overview report query to use this table for the students_with_work count and use it as the denominator in all averages in the overview report.
- Need to do more research to see where else we show averages to see what else will be affected.