perdalum/why-so-many-hits.Rmd

## why-so-many-hits.Rmd
---
title: "Why do we observe a high peak in the amount of hits in 2013 for 'kierkegaard'"
author: "Toke Eskildsen and Per Møldrup-Dalum"
output:
  html_document:
    df_print: paged
---

```{r libraries, message=FALSE, echo=FALSE, warning=FALSE}
library(tidyverse)
library(dplyr)
library(solrium)
library(jsonlite)
library(formattable)
library(scales)
library(knitr)
library(kableExtra)
library(readr)
library(lubridate)
library(here)
```

# Background

In the initial phase of the DeiC National Pilot Project [N.F.S. Grundtvig i danske medier](https://kulturarvscluster.kb.dk/projekter/n-f-s-grundtvigs-i-danske-medier), we have been discussing the design of the project, and thereby which material from The Royal Danish Library to use for the quantitative analyses. One of the cultural heritage collections will be [The Danish Netarchive](http://netarkivet.dk/).

The project aims to explore the effect of Grundtvig in the Danish Culture by looking at how his name appears, in terms of but not limited to frequency, semantics, and graph networks.

So, in order to getting to understand the data, we decided to use Kierkegaard for a comparison to Grundtvig.

This text describes the discovery of an anomaly in the frequency of Kierkegaard-hits. I.e. we observed far too many documents containing the term 'kierkegaard' compared to what our intuition would expect. We also try to explain this anomaly, how it might affect studies using the Net Archive, and some paths going forward.

# Prepare the data

We have counted all text documents in the Net Archive, that contain either ´kierkegaard' or 'grundtvig'. To normalise the results, we also counted every text document. All the counts have been grouped by the month the documents were harvested.

```{r, echo=FALSE, warning=FALSE, message=FALSE}
baseline_data <- read_csv(
  here("data/baseline_month.csv"),
  skip = 1,
  col_names = c("year", "month", "harvested"))

grundtvig_data <- read_csv(
  here("data/grundtvig_month.csv"),
  skip = 1,
  col_names = c("grundtvig", "month", "year"))

kierkegaard_data <- read_csv(
  here("data/kierkegaard_month.csv"),
  skip = 1,
  col_names = c("kierkegaard", "month", "year"))
```

In order to remove the fluctuation of the harvest process, we use the relative amount of hits compared to the total number of harvested documents in a given month, i.e. the percentage of documents containing the search term.

```{r, echo=FALSE, message=FALSE, warning=FALSE}
full_join(
    full_join(baseline_data, grundtvig_data, by = c("year", "month")),
    kierkegaard_data,
    by = c("year", "month")
  ) %>%
    mutate(
      grundtvig = if_else(is.na(grundtvig), 0, grundtvig),
      kierkegaard = if_else(is.na(kierkegaard), 0, kierkegaard)
    ) -> data
```

```{r echo=FALSE, message=FALSE, warning=FALSE}
data %>%
  mutate(month_number = 12 * (year - 1990) + (month - 1)) %>%
  filter(month_number>150) %>%
  mutate(
    Grundtvig = 100*grundtvig/harvested,
    Kierkegaard = 100*kierkegaard/harvested
    ) %>%
  select(month_number, Grundtvig, Kierkegaard) %>%
  pivot_longer(!month_number, names_to = "who", values_to = "freq") -> data_freq
```

If we visualise the relative counts as a function of the month they were harvested, we observe the mentioned anomaly.

```{r, echo=FALSE}
data_freq %>%
  ggplot() +
    geom_line(aes(x = month_number, y = freq, color = who)) +
    ggtitle(
      "Grundtvig versus Kierkegaard",
      subtitle = "Number of documents containing 'grundtvig'/'kierkegaard' compared\nto the number of harvested documents"
    ) +
    labs(
      caption = "Source: Netarkivet, 2020",
      x = "Month since January 1990",
      y = "Percent of total number harvested documents",
      color = NULL)
```

Is it possible, that 2% of all documents in august 2013 mentioned Kierkagaard? Well, it was 200 years since his birth, his birthday being May 5th 1813, which is probably part of the explanation, but it would be extreme if that explanation accounted for the complete observation.

Exploring the data by searching and digging down in this anomaly, we discovered that very few domains accounted for most of the hits. One of these domains was a danish newspaper, that in all of 2013 had a topic around Søren Kierkegaard. They implemented this topic by having a drop-down menu on all web pages, containing a link to the topic, visualized by the Kierkegaard name ('Kierkegaard - 200 år'), therefore every document harvested from that newspaper that year appears to be mentioning Kierkegaard. On top of that, said newspaper was and still is, harvested with a very high frequency, thereby boosting the effect. We confirmed the use of Kierkegaard in the menus by visual inspection, as can be seen in this screenshot. An interactive example can be enjoyed at The Internet Archive: [JP 25. august 2013](https://web.archive.org/web/20130825140412/http://jyllands-posten.dk/).

![Example of Kierkegaard being part of a web page on a unreleated topic](images/kierkegaard-off-topic.png)

So, to answer the question from above: yes, 2% of the documents from August 2013 in the archive did indeed contain the word Kierkegaard. Just not in a very semantic valuable form.

At the moment, we have no know methods implemented to discern between 'kierkegaard' appearing in menus and as part of the actual content of the webpage.

# Going forward

This is an example on how the technical design of a web page can completely overshadow the actual content, that one tries to analyse. We have always had a suspicion on this, but this is to our knowledge one of the first examples of that actually skewing the results. We were lucky, that the skewness was huge and easily observable.

Even though we have no methods implemented to remedy this, we do have a few ideas:

-   In the specific newspaper, we could remove all instances of the text "Kierkegaard - 200 år" from the results, as that specific wording is used in the menu at least one newspaper. Still, that would only handle the skewing for that specific newspaper.

-   We could identify all websites having an unreasonably high counts of 'kierkegaard, and eliminate those complete websites from the result. This would, of course, introduce other skewness.

-   We could look only at the board harvests, as done by [Probing a Nation's Web Domain](https://kulturarvscluster.kb.dk/projekter/p002) project. Like above, this introduces other forms of skewness.

-   We could use an advanced re-rendering of the source HTML, and try to identify how to discern between design and content elements. As there is no standard way of building web pages, this would also have to be implemented per domain/media house/web publisher.

-   Instead of re-rendering the complete page, we could use existing tools for just extracting the visible parts of a webpage from the HTML code.

-   We could come up with a heuristic identifying when Kierkegaard is used in a sentence and not in design elements. This method could be a more general solution, as it is based on linguistics and not web design or technicalities.

It is important to realize that without some sort of processing, this text data cannot be used for topic modelling in any form.
	---
	title: "Why do we observe a high peak in the amount of hits in 2013 for 'kierkegaard'"
	author: "Toke Eskildsen and Per Møldrup-Dalum"
	output:
	html_document:
	df_print: paged
	---

	```{r libraries, message=FALSE, echo=FALSE, warning=FALSE}
	library(tidyverse)
	library(dplyr)
	library(solrium)
	library(jsonlite)
	library(formattable)
	library(scales)
	library(knitr)
	library(kableExtra)
	library(readr)
	library(lubridate)
	library(here)
	```

	# Background

	In the initial phase of the DeiC National Pilot Project [N.F.S. Grundtvig i danske medier](https://kulturarvscluster.kb.dk/projekter/n-f-s-grundtvigs-i-danske-medier), we have been discussing the design of the project, and thereby which material from The Royal Danish Library to use for the quantitative analyses. One of the cultural heritage collections will be [The Danish Netarchive](http://netarkivet.dk/).

	The project aims to explore the effect of Grundtvig in the Danish Culture by looking at how his name appears, in terms of but not limited to frequency, semantics, and graph networks.

	So, in order to getting to understand the data, we decided to use Kierkegaard for a comparison to Grundtvig.

	This text describes the discovery of an anomaly in the frequency of Kierkegaard-hits. I.e. we observed far too many documents containing the term 'kierkegaard' compared to what our intuition would expect. We also try to explain this anomaly, how it might affect studies using the Net Archive, and some paths going forward.

	# Prepare the data

	We have counted all text documents in the Net Archive, that contain either ´kierkegaard' or 'grundtvig'. To normalise the results, we also counted every text document. All the counts have been grouped by the month the documents were harvested.

	```{r, echo=FALSE, warning=FALSE, message=FALSE}
	baseline_data <- read_csv(
	here("data/baseline_month.csv"),
	skip = 1,
	col_names = c("year", "month", "harvested"))

	grundtvig_data <- read_csv(
	here("data/grundtvig_month.csv"),
	skip = 1,
	col_names = c("grundtvig", "month", "year"))

	kierkegaard_data <- read_csv(
	here("data/kierkegaard_month.csv"),
	skip = 1,
	col_names = c("kierkegaard", "month", "year"))
	```

	In order to remove the fluctuation of the harvest process, we use the relative amount of hits compared to the total number of harvested documents in a given month, i.e. the percentage of documents containing the search term.

	```{r, echo=FALSE, message=FALSE, warning=FALSE}
	full_join(
	full_join(baseline_data, grundtvig_data, by = c("year", "month")),
	kierkegaard_data,
	by = c("year", "month")
	) %>%
	mutate(
	grundtvig = if_else(is.na(grundtvig), 0, grundtvig),
	kierkegaard = if_else(is.na(kierkegaard), 0, kierkegaard)
	) -> data
	```

	```{r echo=FALSE, message=FALSE, warning=FALSE}
	data %>%
	mutate(month_number = 12 * (year - 1990) + (month - 1)) %>%
	filter(month_number>150) %>%
	mutate(
	Grundtvig = 100*grundtvig/harvested,
	Kierkegaard = 100*kierkegaard/harvested
	) %>%
	select(month_number, Grundtvig, Kierkegaard) %>%
	pivot_longer(!month_number, names_to = "who", values_to = "freq") -> data_freq
	```

	If we visualise the relative counts as a function of the month they were harvested, we observe the mentioned anomaly.

	```{r, echo=FALSE}
	data_freq %>%
	ggplot() +
	geom_line(aes(x = month_number, y = freq, color = who)) +
	ggtitle(
	"Grundtvig versus Kierkegaard",
	subtitle = "Number of documents containing 'grundtvig'/'kierkegaard' compared\nto the number of harvested documents"
	) +
	labs(
	caption = "Source: Netarkivet, 2020",
	x = "Month since January 1990",
	y = "Percent of total number harvested documents",
	color = NULL)
	```

	Is it possible, that 2% of all documents in august 2013 mentioned Kierkagaard? Well, it was 200 years since his birth, his birthday being May 5th 1813, which is probably part of the explanation, but it would be extreme if that explanation accounted for the complete observation.

	Exploring the data by searching and digging down in this anomaly, we discovered that very few domains accounted for most of the hits. One of these domains was a danish newspaper, that in all of 2013 had a topic around Søren Kierkegaard. They implemented this topic by having a drop-down menu on all web pages, containing a link to the topic, visualized by the Kierkegaard name ('Kierkegaard - 200 år'), therefore every document harvested from that newspaper that year appears to be mentioning Kierkegaard. On top of that, said newspaper was and still is, harvested with a very high frequency, thereby boosting the effect. We confirmed the use of Kierkegaard in the menus by visual inspection, as can be seen in this screenshot. An interactive example can be enjoyed at The Internet Archive: [JP 25. august 2013](https://web.archive.org/web/20130825140412/http://jyllands-posten.dk/).

	![Example of Kierkegaard being part of a web page on a unreleated topic](images/kierkegaard-off-topic.png)

	So, to answer the question from above: yes, 2% of the documents from August 2013 in the archive did indeed contain the word Kierkegaard. Just not in a very semantic valuable form.

	At the moment, we have no know methods implemented to discern between 'kierkegaard' appearing in menus and as part of the actual content of the webpage.

	# Going forward

	This is an example on how the technical design of a web page can completely overshadow the actual content, that one tries to analyse. We have always had a suspicion on this, but this is to our knowledge one of the first examples of that actually skewing the results. We were lucky, that the skewness was huge and easily observable.

	Even though we have no methods implemented to remedy this, we do have a few ideas:

	- In the specific newspaper, we could remove all instances of the text "Kierkegaard - 200 år" from the results, as that specific wording is used in the menu at least one newspaper. Still, that would only handle the skewing for that specific newspaper.

	- We could identify all websites having an unreasonably high counts of 'kierkegaard, and eliminate those complete websites from the result. This would, of course, introduce other skewness.

	- We could look only at the board harvests, as done by [Probing a Nation's Web Domain](https://kulturarvscluster.kb.dk/projekter/p002) project. Like above, this introduces other forms of skewness.

	- We could use an advanced re-rendering of the source HTML, and try to identify how to discern between design and content elements. As there is no standard way of building web pages, this would also have to be implemented per domain/media house/web publisher.

	- Instead of re-rendering the complete page, we could use existing tools for just extracting the visible parts of a webpage from the HTML code.

	- We could come up with a heuristic identifying when Kierkegaard is used in a sentence and not in design elements. This method could be a more general solution, as it is based on linguistics and not web design or technicalities.

	It is important to realize that without some sort of processing, this text data cannot be used for topic modelling in any form.