Skip to content

Instantly share code, notes, and snippets.

@konstantinstadler
Created April 9, 2021 12:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save konstantinstadler/dc3583a4674a39def0d4611c095eb788 to your computer and use it in GitHub Desktop.
Save konstantinstadler/dc3583a4674a39def0d4611c095eb788 to your computer and use it in GitHub Desktop.
Example script for how to parse data to add a new classification column to the country converter coco: https://github.com/konstantinstadler/country_converter
""" Parse Global Health Data Exchange (GHDx) / Global Burden of Disease (GBD) numeric country codes for coco
This needs only to be done once, but might be a good guide for other inputs as well
Data sources:
- GHDx: http://ghdx.healthdata.org/
- Codebook with country codes: ghdx.healthdata.org/sites/default/files/ihme_query_tool/IHME_GBD_2019_CODEBOOK.zip
"""
import pandas as pd
import country_converter
data = pd.read_excel('./IHME_GBD_2019_ALL_LOCATIONS_HIERARCHIES_Y2020M10D15.XLSX', sheet_name='Sheet1', header=0, engine='openpyxl')
col_new_code = 'Location ID'
col_country_names = 'Location Name'
data.drop_duplicates(subset=[col_new_code, col_country_names], inplace=True)
coco = country_converter.CountryConverter(include_obsolete=True)
data.loc[:, 'converted'] = data.loc[:, col_country_names].apply(coco.convert, src='regex', to='name_short', not_found='not_found')
converted_with_duplicates = data[data.converted != 'not_found']
# Results
# Make sure to deal with the duplicates
found_duplicates = converted_with_duplicates[converted_with_duplicates.loc[:, 'converted'].duplicated(keep=False)]
# The country codes sorted in based on the data of the country converter
result_removed_duplicates = converted_with_duplicates.drop_duplicates(subset=['converted']).set_index('converted', drop=True).reindex(coco.data.name_short).fillna('')
# save results
xlsxwriter = pd.ExcelWriter('converted.xlsx', engine='openpyxl')
found_duplicates.to_excel(xlsxwriter, sheet_name='duplicates')
result_removed_duplicates.to_excel(xlsxwriter, sheet_name='sorted_results')
xlsxwriter.save()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment