Skip to content

Instantly share code, notes, and snippets.

@wnarifin
Last active February 7, 2021 12:58
Show Gist options
  • Save wnarifin/a608e60b6d35fdb369ee8133b30d36ab to your computer and use it in GitHub Desktop.
Save wnarifin/a608e60b6d35fdb369ee8133b30d36ab to your computer and use it in GitHub Desktop.
Scraping data on Covid-19 in Malaysia from DG of Health daily press release

About

I find it tedious to manually update and maintain Excel files of Covid-19 situation in Malaysia. The Director General of Health releases daily press statement in the website, in text format (only some info in Table): https://kpkesihatan.com/

For data analysis, data in structured format is prefered, but this is not readily available from the source. Thus, the data must be updated manually everyday from the website or by listening to the press conference.

In this Gist, I share a method to scrape data from the web page of the daily update in R.

You may also have a look at my scraped data sets and updated scripts at https://github.com/wnarifin/covid-19-malaysia.

Required packages

library(rvest)
library(stringr)
library(xlsx)
library(readxl)

Getting the URL

I automate the URL generation to match the date today, based on the patterns of the URL:

my_date = Sys.Date()
my_day = format(as.Date(my_date), "%d")
my_day_no = as.numeric(my_day)
my_mo = format(as.Date(my_date), "%m")
my_mo_no = as.numeric(my_mo)
my_mo_list = c("januari", "februari", "mac", "april", "mei", "jun", "julai", "ogos", "september", "oktober", "november", "disember")
kpk_url = paste0("https://kpkesihatan.com/2020/", my_mo, "/", my_day, "/kenyataan-akhbar-kpk-", my_day_no,
"-", my_mo_list[my_mo_no], "-2020-situasi-semasa-jangkitan-penyakit-coronavirus-2019-covid-19-di-malaysia/")
kpk_url1 = paste0("https://kpkesihatan.com/2020/", my_mo, "/", my_day, "/kenyataan-akhbar-", my_day_no,
                 "-", my_mo_list[my_mo_no], "-2020-situasi-semasa-jangkitan-penyakit-coronavirus-2019-covid-19-di-malaysia/")

Read the page, test any of the URLs

kpk_page = try(read_html(kpk_url), T)
if (class(kpk_page) == "try-error") {kpk_page = try(read_html(kpk_url1), T)} else {kpk_page}
str(kpk_page)  # make sure html page is loaded, not error

Read the table

In the page towards the end, there will be two tables detailing the daily new cases by state. Tables: 1 - Malay, 2 - English. Sometimes the English one is not available, thus better stick to 1. There is also total cases for all states, which we can use to get new cases.

my_cases = html_nodes(kpk_page, "table")[1]
my_table = html_table(my_cases, fill = T, header = T)
my_table = as.data.frame(my_table)

Read the text

This is where things get more challenging. We need to obtain, the number of new recovery, ICU, respiratory support and death.

my_text = html_nodes(kpk_page, "p")
html_text(my_text)

Get the numbers for whole Malaysia

New cases

new_cases = my_table[17, 2]  # more reliable from table

Recover

loc = grep("kes yang telah pulih", html_text(my_text), ignore.case = T)
pulih = html_nodes(my_text[loc], "strong")[1]
recover = as.numeric(word(html_text(pulih, trim = T)))

ICU and respiratory support

str_split(html_text(my_text[4]), " ")

loc = grep("sedang dirawat di Unit Rawatan Rapi", html_text(my_text), ignore.case = T)
urr = html_nodes(my_text[loc], "strong")[1]
icu = as.numeric(word(html_text(urr, trim = T)))

loc = grep("kes memerlukan bantuan pernafasan", my_text, ignore.case = T)
bantuan = grep("bantuan", str_split(html_text(my_text[loc]), " ", simplify = T))
support = as.numeric(word(html_text(my_text[loc]), bantuan - 3))

Death

For death, we need to find

  • in the text, and use the right one.

    my_li = html_nodes(kpk_page, "li")
    loc = grep("Kes kematian", my_li)
    
    new_deaths = length(loc)

    Data frame for daily update for whole Malaysia

    data_all = data.frame(date=my_date, location="Malaysia", new_cases=new_cases, new_deaths=new_deaths, 
                          total_cases=NA, total_deaths=NA, recover=recover, total_recover=NA, 
                          icu=icu, support=support)
    data_all

    Update existing xls [whole Malaysia]

    I keep the data in an Excel file, named "covid-19_my_full.xls", which I update daily before reading it into R.

    # read prexisting xls first, the append new row to existing dataframe
    data_temp = read_xls("covid-19_my_full.xls")
    data_temp = rbind(data_temp, data_all)
    data_temp = as.data.frame(data_temp)
    
    # write to xls, change to your file name
    write.xlsx2(data_temp, "covid-19_my_full.xls", sheet = "main", showNA = F, row.names = F)

    Get the numbers for each state in Malaysia

    Now we're done with the data for the whole Malaysia, we want to get the numbers for each state.

    New cases

    This is easily obtained from the table my_table.

    colnames(my_table) = c("state", "new_cases", "total_cases")
    my_table[,2] = as.numeric(str_remove_all(my_table[,2], ","))  # remove ","
    my_table[,3] = as.numeric(str_remove_all(my_table[,3], ","))  # remove ","
    data_state = my_table

    Deaths

    This is the most difficult part of the web scraping. The place of death is not consistently mentioned in the text. Sometimes the name of the hospital is mentioned together with the name of the state. Sometimes only the name of the hospital is mentioned. Thus this part can be improved over time. For the basic block, we will again use the

  • my_li that we obtain earlier.

    # deaths by state [trial]
    negeri = c("Perlis", "Kedah", "Pulau Pinang", "Perak", "Selangor", "Negeri Sembilan", "Melaka", "Johor", "Pahang", "Terengganu", "Kelantan", "Sabah", "Sarawak", "Kuala Lumpur", "Putrajaya", "Labuan")
    negeri_text = str_flatten(html_text(my_li[loc]))
    # replacements, add more replacement list over time
    negeri_text = str_replace(negeri_text, "Universiti Malaya", "Kuala Lumpur")
    negeri_text = str_replace(negeri_text, "Hospital Sungai Buloh, Selangor", "-, Selangor")
    negeri_text = str_replace(negeri_text, "Hospital Sungai Buloh", "Selangor")
    negeri_text = str_replace(negeri_text, "Hospital Selayang, Selangor", "-, Selangor")
    negeri_text = str_replace(negeri_text, "Hospital Selayang", "Selangor")
    negeri_text = str_replace(negeri_text, "Kuching, Sarawak", "-")
    negeri_text = str_replace(negeri_text, "daripada Perlis", "-")
    # count the states
    new_deaths_state = str_count(negeri_text, negeri); new_deaths_state

    Known issues with the method:

    • state name not mentioned.
    • multiple mention of state name in text.

    Now add the death count to the data_state:

    data_state$new_deaths = c(new_deaths_state, sum(new_deaths_state))
    data_state

    Update existing xls [state]

    I keep the data in an Excel file, named "covid-19_my_state.xls", which I update daily before reading it into R. I keep the newly fetched data into a new sheet, named in %Y%m%d format, e.g. 20200415. This will change to the date today.

    write.xlsx2(data_state, "covid-19_my_state.xls", sheetName = paste0(format(as.Date(my_date), "%Y%m%d")), append = T, showNA = F, row.names = F)

    Conclusion

    Using the method, it will make the process of getting new data easier, without having to read the press release in detail. This can be easily integrated in any analysis workflow in R.

  • @DrJohan
    Copy link

    DrJohan commented Sep 28, 2020

    Great work!!!!!

    @drxyzw
    Copy link

    drxyzw commented Nov 23, 2020

    Thank you for the helpful tips!
    I used to fetch COVID-19 numbers by state from Wikipedia but it's gone.
    Based on the tips here, I created a python version to fetch data (only new/cumul cases by state though):
    https://github.com/drxyzw/WuhanCoronavirus-Malaysia/blob/master/dataFromHealthDirectorBlog.py

    @wnarifin
    Copy link
    Author

    wnarifin commented Nov 25, 2020

    @DrJohan @drxyzw
    You're welcome. You may also have a look at https://github.com/wnarifin/covid-19-malaysia. I also include a scripts folder with updated codes. Some are quite messy though to adapt to changes in the kpkesihatan webpage. I also wrote a related new gist here.

    @profjamal
    Copy link

    Great work Arifin! Do you have data on cluster?

    @wnarifin
    Copy link
    Author

    @profjamal
    No I don't. Because the info is not reported in consistent format in the daily press release (sometimes in text, table and even video), making it quite difficult to scrape the data.

    @ukkdosm
    Copy link

    ukkdosm commented Jan 29, 2021

    @drxyzw [python - automate]
    @wnarifin [R]

    Great work everyone! really help me a lot.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment