The Director General of Health Malaysia releases daily press statement in the website at https://kpkesihatan.com/. However, the number of recovered cases by state is not given in table format, making it impossible to scrape the data from the website directly as detailed in my gist here.
However, the number of recovered cases is presented in image format, so I describe here how to perform OCR to get the daily number of recovered cases by state using R. The example presented here is based on press statement on 24-11-2020.
You may also have a look at my scraped data sets and updated scripts at https://github.com/wnarifin/covid-19-malaysia.
library(rvest)
library(tesseract)
library(magick)
library(magrittr)
library(stringr)
# Date
my_date = Sys.Date()
# my_date = "2020-11-01" # if you want other date in yyyy-mm-dd format
my_day = format(as.Date(my_date), "%d")
my_day_no = as.numeric(my_day)
my_mo = format(as.Date(my_date), "%m")
my_mo_no = as.numeric(my_mo)
# Set URL
my_mo_list = c("januari", "februari", "mac", "april", "mei", "jun", "julai", "ogos", "september", "oktober", "november", "disember")
kpk_url = paste0("https://kpkesihatan.com/2020/", my_mo, "/", my_day, "/kenyataan-akhbar-kpk-", my_day_no,
"-", my_mo_list[my_mo_no], "-2020-situasi-semasa-jangkitan-penyakit-coronavirus-2019-covid-19-di-malaysia/")
kpk_page = read_html(kpk_url)
str(kpk_page) # make sure HTML page is loaded
## List of 2
## $ node:<externalptr>
## $ doc :<externalptr>
## - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
img_node = html_nodes(kpk_page, "img")
img_loc = grep("discaj", img_node, ignore.case = T) # get node with discaj
img_link = html_attr(img_node[img_loc], "data-orig-file") # get the content of attribute in a tag
img_data = image_read(img_link)
For Kelantan, size = 80x22 at pixel location left upper side = 200,348. Change for other states.
state = "KELANTAN"
img_data_kelantan = img_data %>% image_scale("794x446") %>% image_crop("80x22+200+348") %>% image_resize("2000x")
# enlarge the image, works better with OCR
recover_data = image_ocr(img_data_kelantan, language = "msa") %>% str_extract_all("[:digit:]", simplify = T) %>%
str_c(collapse = "") %>% as.numeric()
data_recover_kelantan = data.frame(date=my_date, state=state, recover=recover_data)
data_recover_kelantan
## date state recover
## 1 2020-11-24 KELANTAN 4
# Read image data
img_data_state = img_data %>% %>% image_resize("2000x") %>% image_enhance() %>% image_modulate(brightness = 130)
# OCR
recover_img = image_ocr(img_data_state, language = "msa")
recover_data = str_split(recover_img, "[\n]", simplify = T) # split at \n
recover_data = recover_data[grep("kes", recover_data)] # extract index with kes
recover_data = str_c(recover_data, collapse = " ")
recover_data_state = as.numeric(str_split(recover_data, "kes", simplify = T)[1:15])
# Supposed to be 16, but WP KL & WP PUTRAJAYA combined in the image
state_all = c("Perlis", "Kedah", "Pulau Pinang", "Perak", "Selangor",
"WP Kuala Lumpur/Putrajaya", "Negeri Sembilan", "Melaka", "Johor", "Pahang",
"Terengganu", "Kelantan", "Sabah", "Sarawak", "WP Labuan")
state_all = str_to_upper(state_all)
data_recover_state = data.frame(date=rep(my_date,length(recover_data_state)), state=state_all, recover=recover_data_state)
data_recover_state
## date state recover
## 1 2020-11-24 PERLIS 1
## 2 2020-11-24 KEDAH 45
## 3 2020-11-24 PULAU PINANG 34
## 4 2020-11-24 PERAK 7
## 5 2020-11-24 SELANGOR 465
## 6 2020-11-24 WP KUALA LUMPUR/PUTRAJAYA 108
## 7 2020-11-24 NEGERI SEMBILAN 143
## 8 2020-11-24 MELAKA 4
## 9 2020-11-24 JOHOR 10
## 10 2020-11-24 PAHANG 0
## 11 2020-11-24 TERENGGANU 4
## 12 2020-11-24 KELANTAN 4
## 13 2020-11-24 SABAH 833
## 14 2020-11-24 SARAWAK 10
## 15 2020-11-24 WP LABUAN 5
The resulting data frame can be combined with the daily update of new cases and deaths by state. However, the challenge remains because WP Kualan Lumpur and WP Putrajaya are combined for some reason.
Using the method, it will make the process of fetching the number of recovered cases possible, although things get more difficult somehow. This can be easily integrated in any analysis workflow in R. On the other hand, it is hoped that MOH will release the data in a more analysis friendly format in the future.