Skip to content

Instantly share code, notes, and snippets.

@benmarwick
Last active April 7, 2023 17:17
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save benmarwick/27b3a8df2b141158dd4b1daf8e6c04f7 to your computer and use it in GitHub Desktop.
Save benmarwick/27b3a8df2b141158dd4b1daf8e6c04f7 to your computer and use it in GitHub Desktop.
Scraping academic jobs on wikia.com
library(tidyverse)
base_url <- "http://academicjobs.wikia.com/wiki/Archaeology_Jobs_"
# starts at 2010-2011
years <- map_chr(2010:2019, ~str_glue('{.x}-{.x +1}'))
# though it seems to start at 2007-8: https://academicjobs.fandom.com/wiki/Archaeology_07-08
urls_for_each_year <- str_glue('{base_url}{years}')
library(rvest)
#------------------------------------------
# 2010-2011 has no table
urls_for_each_year[1] %>%
read_html() %>%
# html_nodes('.mw-content-text') %>%
html_nodes('.mw-headline') %>%
html_text()
#------------------------------------------
# table first appears in 2011-2012
urls_for_each_year[2] %>%
read_html() %>%
html_node('table , td') %>%
html_table()
# but headings are not systematic
urls_for_each_year[2] %>%
read_html() %>%
# html_nodes('.mw-content-text') %>%
html_nodes('.mw-headline') %>%
html_text()
#------------------------------------------
# table first appears in 2012-2013
urls_for_each_year[3] %>%
read_html() %>%
html_node('table , td') %>%
html_table()
urls_for_each_year[3] %>%
read_html() %>%
# html_nodes('.mw-content-text') %>%
html_nodes('.mw-headline') %>%
html_text()
#-------------------------------------
# all years
urls_for_each_year_headers <-
map(urls_for_each_year,
~.x %>%
read_html() %>%
html_nodes('.mw-headline') %>%
html_text())
# what are the different sections?
# "TENURE-TRACK POSITIONS"
# "TENURE-TRACK OR TENURED / FULL-TIME POSITIONS"
# "Tenure-Track or Tenured / Full-time Position "
# "ASSISTANT PROFESSOR OR OPEN RANK"
# "TENURE TRACK ASSISTANT PROFESSOR OR OPEN RANK"
# "TENURED ASSOCIATE OR FULL PROFESSOR"
# "ASSOCIATE OR FULL PROFESSOR"
# "NON-TENURE-TRACK POSITIONS"
# "VISITING POSITIONS / Limited-Term Appointments / Postdocs"
# "VISITING POSITIONS / LIMITED TERM APPOINTMENTS / POSTDOCS"
# "VISITING POSITIONS / LIMITED-TERM APPOINTMENTS / POSTDOCS / PART-TIME POSITIONS"
# "VISITING POSITIONS"
# "COMPLETED SEARCHES"
# "DISCUSSION, RUMORS AND SPECULATION"
@benmarwick
Copy link
Author

Relevant literature:

Karl, R., Möller, K. and Krierer, K. 2012. Ain’t got no job. The archaeology labour market in Austria,Germany and the UK, 2007-2012. Vienna: http://www.archaeologieforum.at https://www.academia.edu/9367180/Ain_t_got_no_job_The_archaeology_labour_market_in_Austria_Germany_and_the_UK_2007_2012_Vienna_I%C3%96AF_2012

Boothby, C., Milojević, S. An exploratory full-text analysis of Science Careers in a changing academic job market. Scientometrics 126, 4055–4071 (2021). https://doi.org/10.1007/s11192-021-03905-2

Rachael Pitt & Inger Mewburn (2016) Academic superheroes? A critical analysis of academic job descriptions, Journal of Higher Education Policy and Management, 38:1, 88-101, DOI: 10.1080/1360080X.2015.1126896

Leon, L. A., Seal, K. C., Przasnyski, Z. H., & Wiedenman, I. (2018). Skills and competencies required for jobs in business analytics: A content analysis of job advertisements using text mining. In Operations and Service Management: Concepts, Methodologies, Tools, and Applications (pp. 880-904). IGI Global.

Kortum, H., Rebstadt, J., & Thomas, O. (2022, January). Dissection of AI Job Advertisements: A Text Mining-based Analysis of Employee Skills in the Disciplines Computer Vision and Natural Language Processing. In Proceedings of the 55th Hawaii International Conference on System Sciences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment