Skip to content

Instantly share code, notes, and snippets.

@wnarifin
Last active November 30, 2020 14:31
Show Gist options
  • Save wnarifin/38dba66d2c92056a1b4e8181b57396e2 to your computer and use it in GitHub Desktop.
Save wnarifin/38dba66d2c92056a1b4e8181b57396e2 to your computer and use it in GitHub Desktop.
Web scraping the number of COVID-19 daily recovered cases in Malaysia by OCR method

About

The Director General of Health Malaysia releases daily press statement in the website at https://kpkesihatan.com/. However, the number of recovered cases by state is not given in table format, making it impossible to scrape the data from the website directly as detailed in my gist here.

However, the number of recovered cases is presented in image format, so I describe here how to perform OCR to get the daily number of recovered cases by state using R. The example presented here is based on press statement on 24-11-2020.

You may also have a look at my scraped data sets and updated scripts at https://github.com/wnarifin/covid-19-malaysia.

Required Libraries

library(rvest)
library(tesseract)
library(magick)
library(magrittr)
library(stringr)

Set the date and URL

# Date
my_date = Sys.Date()
# my_date = "2020-11-01"  # if you want other date in yyyy-mm-dd format
my_day = format(as.Date(my_date), "%d")
my_day_no = as.numeric(my_day)
my_mo = format(as.Date(my_date), "%m")
my_mo_no = as.numeric(my_mo)
# Set URL
my_mo_list = c("januari", "februari", "mac", "april", "mei", "jun", "julai", "ogos", "september", "oktober", "november", "disember")
kpk_url = paste0("https://kpkesihatan.com/2020/", my_mo, "/", my_day, "/kenyataan-akhbar-kpk-", my_day_no,
                 "-", my_mo_list[my_mo_no], "-2020-situasi-semasa-jangkitan-penyakit-coronavirus-2019-covid-19-di-malaysia/")

Read the page

kpk_page = read_html(kpk_url)
str(kpk_page)  # make sure HTML page is loaded
## List of 2
##  $ node:<externalptr> 
##  $ doc :<externalptr> 
##  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

Get and read the image for daily number recovered

img_node = html_nodes(kpk_page, "img")
img_loc = grep("discaj", img_node, ignore.case = T)  # get node with discaj
img_link = html_attr(img_node[img_loc], "data-orig-file")  # get the content of attribute in a tag
img_data = image_read(img_link)

Read for one state, e.g. Kelantan

For Kelantan, size = 80x22 at pixel location left upper side = 200,348. Change for other states.

state = "KELANTAN"
img_data_kelantan = img_data %>% image_scale("794x446") %>% image_crop("80x22+200+348") %>% image_resize("2000x")
# enlarge the image, works better with OCR

OCR and get the number

recover_data = image_ocr(img_data_kelantan, language = "msa") %>% str_extract_all("[:digit:]", simplify = T) %>%
  str_c(collapse = "") %>% as.numeric()
data_recover_kelantan = data.frame(date=my_date, state=state, recover=recover_data)
data_recover_kelantan
##         date    state recover
## 1 2020-11-24 KELANTAN       4

Read data for all states, then OCR and get the numbers

# Read image data
img_data_state = img_data %>% %>% image_resize("2000x") %>% image_enhance() %>% image_modulate(brightness = 130)
# OCR
recover_img = image_ocr(img_data_state, language = "msa")
recover_data = str_split(recover_img, "[\n]", simplify = T)  # split at \n
recover_data = recover_data[grep("kes", recover_data)]  # extract index with kes
recover_data = str_c(recover_data, collapse = " ")
recover_data_state = as.numeric(str_split(recover_data, "kes", simplify = T)[1:15])
# Supposed to be 16, but WP KL & WP PUTRAJAYA combined in the image
state_all = c("Perlis", "Kedah", "Pulau Pinang", "Perak", "Selangor",
              "WP Kuala Lumpur/Putrajaya", "Negeri Sembilan", "Melaka", "Johor", "Pahang",
              "Terengganu", "Kelantan", "Sabah", "Sarawak", "WP Labuan")
state_all = str_to_upper(state_all)
data_recover_state = data.frame(date=rep(my_date,length(recover_data_state)), state=state_all, recover=recover_data_state)
data_recover_state
##          date                     state recover
## 1  2020-11-24                    PERLIS       1
## 2  2020-11-24                     KEDAH      45
## 3  2020-11-24              PULAU PINANG      34
## 4  2020-11-24                     PERAK       7
## 5  2020-11-24                  SELANGOR     465
## 6  2020-11-24 WP KUALA LUMPUR/PUTRAJAYA     108
## 7  2020-11-24           NEGERI SEMBILAN     143
## 8  2020-11-24                    MELAKA       4
## 9  2020-11-24                     JOHOR      10
## 10 2020-11-24                    PAHANG       0
## 11 2020-11-24                TERENGGANU       4
## 12 2020-11-24                  KELANTAN       4
## 13 2020-11-24                     SABAH     833
## 14 2020-11-24                   SARAWAK      10
## 15 2020-11-24                 WP LABUAN       5

Conclusion

The resulting data frame can be combined with the daily update of new cases and deaths by state. However, the challenge remains because WP Kualan Lumpur and WP Putrajaya are combined for some reason.

Using the method, it will make the process of fetching the number of recovered cases possible, although things get more difficult somehow. This can be easily integrated in any analysis workflow in R. On the other hand, it is hoped that MOH will release the data in a more analysis friendly format in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment