hrwgc/kenya_kcse_processing_part1.md Secret

## kenya_kcse_processing_part1.md

      
    Raw
  

              kenya_kcse_processing_part1.md
            
          
    Final Numbers


3,235 schools matched to location information from 2007 and KCSE exam performance information from  2010
Approximately 35,000 unique records (student performance by gender and grade each had separate rows, up to 15+ per school) condensed into 3,235 rows, with records transposed into columns.

Starting Point:

Kenya Grades Database


5,946 unique schools reporting grades for 2010
64,949 unique rows needing to be joined
school_name column contains 7972 unique entries total;
5,946 unique schools reporting grades for 2010
counties, districts, across two databases  not necessarily compatible due to changes since 2007, when schools database was created.
unique identifying school codes, such as "MOE code" and "knec code", in both databases, but no code is in both databases.

Clean Join


The clean join was made using the clean_name_1, clean_name_2, and clean_name_3 columns from the two different databases.
Initial attempts to join based on contents of "name of school" "school name" columns yielded roughly an 8% success rate.
The clean join method joined 1,740 unique records (of the total 5,946 for 2010).

Steps to Achieve Clean Join


Use spreadsheet application and copy/paste the contents of each databases' school name column into a text editor.


Use regular expressions to remove inconsistencies that do not affect meaning.


Some of the main problem areas that I resolved through regular expression find/replace included standardizing capitalization, spacing, use of punctuation.


I processed the school names five different times, increasing the amount of modifications I made with each subsequent processing. There are an additional 6 columns in each database with the new school name for that stage.


Kenya Grades


column name
meaning
original/modification


g_school_name
school name
original


g_clean_name_1
first standardization of school name
modified


g_clean_name_2
second standardization of school name
modified


g_clean_name_3
third standardization of school name
modified


g_clean_name_4
fourth standardization of school name - contains less frequently used terms
modified


g_clean_name_42
fourth standardization of school name - contains most frequently used terms
modified


g_clean_name_5
further refined version of s_clean_name_42
modified


Kenya Schools


column name
meaning
original/modification


s_name_of_school
name of school
original


s_clean_name_1
first standardization of school name
modified


s_clean_name_2
second standardization of school name
modified


s_clean_name_3
third standardization of school name
modified


s_clean_name_4
fourth standardization of school name - contains less frequently used terms
modified


s_clean_name_42
fourth standardization of school name - contains most frequently used terms
modified


s_clean_name_5
further refined version of s_clean_name_42


See full data dictionary here
Join 2


I identified key terms used so frequently as to have lost distinguishing power, but which were preventing joins from happening because of word order. Join 2 resulted from isolating the extra words from school names.


At first attempt, this join connected an additional 4,141 unique records, or a total of 48,877 rows.


This join is not as precise or accurate as the first one.


Best Join Solution


The script I wrote to initiate the join in MySQL is available as a gist

Here is an excerpt from it
LEFT OUTER JOIN kenya_schools ON ( kenya_grades.`g_year` = 2010
    AND ((kenya_grades.`g_school_name` LIKE kenya_schools.`s_name_of_school`)
        OR (kenya_grades.`g_school_name` LIKE kenya_schools.`s_clean_name_1`)
        OR (kenya_grades.`g_school_name` LIKE kenya_schools.`s_clean_name_2`)
        OR (kenya_grades.`g_school_name` LIKE kenya_schools.`s_clean_name_3`)
        OR (kenya_grades.`g_school_name` LIKE kenya_schools.`s_clean_name_5`))
    ....

The script cycles through the versions of the school names available in database 1, attempting to join it to an equivalent one from the second database. Because it is a left outer join, the school grade rows that are not joined are still added to the new table, so we can identify them.
After the join - preparing database for mapping in TileMill

Once the grades were joined to their appropriate schools from the 2007 database, the structure of the resulting table needed to be reformatted to be able to show aggregate student performance. See the scripts here
column name	meaning	original/modification
g_school_name	school name	original
g_clean_name_1	first standardization of school name	modified
g_clean_name_2	second standardization of school name	modified
g_clean_name_3	third standardization of school name	modified
g_clean_name_4	fourth standardization of school name - contains less frequently used terms	modified
g_clean_name_42	fourth standardization of school name - contains most frequently used terms	modified
g_clean_name_5	further refined version of s_clean_name_42	modified
column name	meaning	original/modification
s_name_of_school	name of school	original
s_clean_name_1	first standardization of school name	modified
s_clean_name_2	second standardization of school name	modified
s_clean_name_3	third standardization of school name	modified
s_clean_name_4	fourth standardization of school name - contains less frequently used terms	modified
s_clean_name_42	fourth standardization of school name - contains most frequently used terms	modified
s_clean_name_5	further refined version of s_clean_name_42