Skip to content

Instantly share code, notes, and snippets.

@hrwgc
Created June 29, 2012 19:28
Show Gist options
  • Save hrwgc/efc4b282ac5c80e4500b to your computer and use it in GitHub Desktop.
Save hrwgc/efc4b282ac5c80e4500b to your computer and use it in GitHub Desktop.
Data Processing for Kenyan KCSE Databases

Final Numbers

  • 3,235 schools matched to location information from 2007 and KCSE exam performance information from 2010
  • Approximately 35,000 unique records (student performance by gender and grade each had separate rows, up to 15+ per school) condensed into 3,235 rows, with records transposed into columns.

Starting Point:

Kenya Grades Database

  • 5,946 unique schools reporting grades for 2010
  • 64,949 unique rows needing to be joined
  • school_name column contains 7972 unique entries total;
  • 5,946 unique schools reporting grades for 2010
  • counties, districts, across two databases not necessarily compatible due to changes since 2007, when schools database was created.
  • unique identifying school codes, such as "MOE code" and "knec code", in both databases, but no code is in both databases.

Clean Join

  • The clean join was made using the clean_name_1, clean_name_2, and clean_name_3 columns from the two different databases.
  • Initial attempts to join based on contents of "name of school" "school name" columns yielded roughly an 8% success rate.
  • The clean join method joined 1,740 unique records (of the total 5,946 for 2010).

Steps to Achieve Clean Join

  • Use spreadsheet application and copy/paste the contents of each databases' school name column into a text editor.

  • Use regular expressions to remove inconsistencies that do not affect meaning.

  • Some of the main problem areas that I resolved through regular expression find/replace included standardizing capitalization, spacing, use of punctuation.

  • I processed the school names five different times, increasing the amount of modifications I made with each subsequent processing. There are an additional 6 columns in each database with the new school name for that stage.

Kenya Grades

column name meaning original/modification
g_school_name school name original
g_clean_name_1 first standardization of school name modified
g_clean_name_2 second standardization of school name modified
g_clean_name_3 third standardization of school name modified
g_clean_name_4 fourth standardization of school name - contains less frequently used terms modified
g_clean_name_42 fourth standardization of school name - contains most frequently used terms modified
g_clean_name_5 further refined version of s_clean_name_42 modified

Kenya Schools

column name meaning original/modification
s_name_of_school name of school original
s_clean_name_1 first standardization of school name modified
s_clean_name_2 second standardization of school name modified
s_clean_name_3 third standardization of school name modified
s_clean_name_4 fourth standardization of school name - contains less frequently used terms modified
s_clean_name_42 fourth standardization of school name - contains most frequently used terms modified
s_clean_name_5 further refined version of s_clean_name_42

cleaning of names

See full data dictionary here

Join 2

  • I identified key terms used so frequently as to have lost distinguishing power, but which were preventing joins from happening because of word order. Join 2 resulted from isolating the extra words from school names.

  • At first attempt, this join connected an additional 4,141 unique records, or a total of 48,877 rows.

  • This join is not as precise or accurate as the first one.

Best Join Solution

Here is an excerpt from it

LEFT OUTER JOIN kenya_schools ON ( kenya_grades.`g_year` = 2010
    AND ((kenya_grades.`g_school_name` LIKE kenya_schools.`s_name_of_school`)
        OR (kenya_grades.`g_school_name` LIKE kenya_schools.`s_clean_name_1`)
        OR (kenya_grades.`g_school_name` LIKE kenya_schools.`s_clean_name_2`)
        OR (kenya_grades.`g_school_name` LIKE kenya_schools.`s_clean_name_3`)
        OR (kenya_grades.`g_school_name` LIKE kenya_schools.`s_clean_name_5`))
    ....

The script cycles through the versions of the school names available in database 1, attempting to join it to an equivalent one from the second database. Because it is a left outer join, the school grade rows that are not joined are still added to the new table, so we can identify them.

After the join - preparing database for mapping in TileMill

Once the grades were joined to their appropriate schools from the 2007 database, the structure of the resulting table needed to be reformatted to be able to show aggregate student performance. See the scripts here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment