Skip to content

Instantly share code, notes, and snippets.

@stevecondylios
Last active November 26, 2021 21:02
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save stevecondylios/6b32e5e85813a5e5b7bff7527c933dc9 to your computer and use it in GitHub Desktop.
Save stevecondylios/6b32e5e85813a5e5b7bff7527c933dc9 to your computer and use it in GitHub Desktop.

Here we find world population by timezone (from an extremely reliable source: an online quiz).

First, we can try using rvest to grab the results. But there's a problem, we need to actually play the quiz in order for the UTC offsets to appear on page.

i.e.

library(tidyverse)
library(rvest)

page <- read_html("https://www.sporcle.com/games/segacs/time-zones-in-order-of-population")

df <- page %>% 
  html_nodes("table") %>% 
  .[2] %>% 
  html_table %>% 
  .[[1]]

df %>% 
  head

#   Population (est.) Time Zone (UTC offset)                                                   Notable Includes
# 1     1,578,332,733                     NA China, Malaysia, Central Indonesia, Philippines, Western Australia
# 2     1,313,334,597                     NA                                                   India, Sri Lanka
# 3       779,333,045                     NA                                     Central Europe, Western Africa
# 4       585,168,798                     NA                               European Russia, Arabia, East Africa
# 5       445,080,472                     NA           Eastern Europe, Middle East, Central and Southern Africa
# 6       398,538,888                     NA                                     Southeast Asia, West Indonesia

But we can manually visit the page, start the quiz, select 'give up', and that will load the answers (UTC offsets). Now in the browser select File -> Save page as. I saved it as in my Downloads folder as tz.html

updated_html <- read_html("Downloads/tz.html")

updated_df <- updated_html %>% 
  html_nodes("table") %>% 
  .[2] %>% 
  html_table %>% 
  .[[1]]

updated_df %>% head
#  Population (est.) Time Zone (UTC offset)                                                   Notable Includes
# 1     1,578,332,733             UTC +08:00 China, Malaysia, Central Indonesia, Philippines, Western Australia
# 2     1,313,334,597             UTC +05:30                                                   India, Sri Lanka
# 3       779,333,045             UTC +01:00                                     Central Europe, Western Africa
# 4       585,168,798             UTC +03:00                               European Russia, Arabia, East Africa
# 5       445,080,472             UTC +02:00           Eastern Europe, Middle East, Central and Southern Africa
# 6       398,538,888             UTC +07:00                                     Southeast Asia, West Indonesia

(out <- updated_df %>% 
  rename(
    utc_offset = 'Time Zone (UTC offset)',
    population = 'Population (est.)',
    includes = 'Notable Includes') %>% 
  mutate(
    offset_minutes = substr(utc_offset, nchar(utc_offset)-1, nchar(utc_offset)),
    utc_offset_is_an_hour_multiple_of_one = ifelse(offset_minutes == "00", TRUE, FALSE),
    population = str_remove_all(population, ",") %>% as.numeric
    ) %>% 
  group_by(utc_offset_is_an_hour_multiple_of_one) %>% 
  summarise(total_pop = sum(population)))

# A tibble: 2 x 2
  utc_offset_is_an_hour_multiple_of_one  total_pop
  <lgl>                                      <dbl>
1 FALSE                                 1537594167
2 TRUE                                  5782896345

1537594167 / (1537594167 + 5782896345)
# [1] 0.2100398

So the answer is 21% of the world's population lives in a time zone whose UTC offset (in hours) isn't a multiple of 1.

Data for future reference

Here's an easy way to access a cleaned up data.frame in R in the future. Simply copy the following code into R:

df <- structure(list(population = c(1578332733, 1313334597, 779333045, 
585168798, 445080472, 398538888, 301108339, 288163120, 252535680, 
241618458, 240604308, 208231089, 187360619, 99051407, 79620200, 
61526901, 54364022, 40307463, 32872730, 32564342, 31400559, 30986975, 
24213510, 6157399, 1971761, 1919028, 1713232, 773395, 731910, 
528336, 299586, 55593, 9264, 8809, 2784, 600, 360, 200, 0), utc_offset = c("UTC +08:00", 
"UTC +05:30", "UTC +01:00", "UTC +03:00", "UTC +02:00", "UTC +07:00", 
"UTC -05:00", "UTC +05:00", "UTC ±00:00", "UTC -03:00", "UTC -06:00", 
"UTC +06:00", "UTC +09:00", "UTC -04:00", "UTC +03:30", "UTC -08:00", 
"UTC +06:30", "UTC +04:00", "UTC -07:00", "UTC +04:30", "UTC +10:00", 
"UTC +05:45", "UTC +08:30", "UTC +12:00", "UTC +09:30", "UTC +11:00", 
"UTC -10:00", "UTC -01:00", "UTC -09:00", "UTC -03:30", "UTC +13:00", 
"UTC -11:00", "UTC -09:30", "UTC +14:00", "UTC -02:00", "UTC +12:45", 
"UTC +10:30", "UTC +08:45", "UTC -12:00"), includes = c("China, Malaysia, Central Indonesia, Philippines, Western Australia", 
"India, Sri Lanka", "Central Europe, Western Africa", "European Russia, Arabia, East Africa", 
"Eastern Europe, Middle East, Central and Southern Africa", "Southeast Asia, West Indonesia", 
"Eastern Time (US/Can), Cuba, Haiti, South America", "Pakistan, Central Asia, Maldives", 
"Coordinated Universal Time - UK, West Africa", "Argentina, Brazil, Falklands, Uruguay, Greenland", 
"Central Time (US/Can/Mex), Central America, Galapagos, Easter Island", 
"Bangladesh, Bhutan, Omsk, East Kazakhstan, Xinjiang (unofficial)", 
"Japan, South Korea", "Atlantic Time (Can), Caribbean, Venezuela, Bolivia, Guyana, Amazonas (Brazil)", 
"Iran", "Pacific Time (US/Can), Pitcairn Islands", "Myanmar, Cocos Islands", 
"Samara, Caucasus, Gulf", "Mountain Time (US/Can/Mex)", "Afghanistan", 
"Eastern Australia, Papua New Guinea, Primorsky", "Nepal", "North Korea", 
"New Zealand, Fiji", "Central Australia (NT, SA)", "New Caledonia, Solomon Islands, Vanuatu", 
"French Polynesia, Hawaii, Cook Islands", "Cabo Verde, Azores, Eastern Greenland", 
"Alaska, Gambier Islands", "Newfoundland, Southeastern Labrador", 
"Samoa, Tonga", "American Samoa, Niue", "Marquesas Islands (French Polynesia)", 
"Line Islands (Kiribati)", "Fernando de Noronha, South Georgia and South Sandwich Islands", 
"Chatham Islands (New Zealand)", "Lord Howe Island (Australia)", 
"Eucla (Australia) - unofficial", "Howland Island, Baker Island (USA) - uninhabited"
)), class = "data.frame", row.names = c(NA, -39L))

df %>% head
#   population utc_offset                                                           includes
# 1 1578332733 UTC +08:00 China, Malaysia, Central Indonesia, Philippines, Western Australia
# 2 1313334597 UTC +05:30                                                   India, Sri Lanka
# 3  779333045 UTC +01:00                                     Central Europe, Western Africa
# 4  585168798 UTC +03:00                               European Russia, Arabia, East Africa
# 5  445080472 UTC +02:00           Eastern Europe, Middle East, Central and Southern Africa
# 6  398538888 UTC +07:00                                     Southeast Asia, West Indonesia
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment