benmarwick/viralarchive.Rmd

## viralarchive.Rmd


I used the Python library GetOldTweets3 to get the tweets because the rtweet package cannot get tweets older than 6-9 days. Details about this Python library are here: https://github.com/Mottl/GetOldTweets3

I used this line in the shell to get tweets using the #viralarchive hashtag:

```{bash, engine.opts="-l", eval = F}
GetOldTweets3 --querysearch 'viralarchive' --maxtweets 10000
```

The Python function saves the tweets into a CSV file. We can read it into R like this:

```{r}
library(tidyverse)
library(lubridate)
va <- readr::read_csv("output_got.csv") %>%
  filter(ymd_hms(date) > "2019-01-01")
```

Here is a time series plot of tweets per day:

```{r}
va %>%
  mutate(day = round_date(ymd_hms(date), "day")) %>%
  group_by(day) %>%
  summarise(n_tweets = n()) %>%
  drop_na %>%
ggplot() +
  aes(day, n_tweets) +
  geom_line()
```


Here we try to get the photos from each tweet that has them. This is not easy because the Python function doesn't give a us a specific column of media attached to the tweet. Also Twitter make it very difficult to scrape content automatically.

Here we pretend that we are an iPhone because the tweet layout is very simple on that device. Then we take a screenshot

```{r}
library(rvest)
library(httr)
httr::user_agent('Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_2 like Mac OS X; en-us)')

url <- va$permalink

library(webshot)
webshot(url,
        file = paste0("images/", basename(url), ".jpg"),
        delay = 0.1,
        # top, left, width, and height.
       # cliprect = c(150,
       #              200,
       #              600,
       #              600),
        useragent = 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_2 like Mac OS X; en-us)')
```

Try to crop the screenshots to focus on the image. We look for the locaiton of the #viralarchive text at the top, and the 'Twitter' text at the bottom. It's not very good, but it's better than nothing, maybe.


```{r}
library(tesseract)
eng <- tesseract("eng")

imgs <- list.files("images", full.names = TRUE)

for(img in imgs){

  tryCatch({

text_ocr <-
  tesseract::ocr_data(img,
                 engine = eng)

bbox_tw <-
  text_ocr %>%
  mutate(word = tolower(str_replace_all(word, "[^a-zA-Z]+", ""))) %>%
  filter(word == "twitter") %>%
  pull(bbox) %>%
  str_split(",") %>%
  unlist() %>%
  as.numeric()

bbox_va <-
  text_ocr %>%
  mutate(word = tolower(str_replace_all(word, "[^a-zA-Z]+", ""))) %>%
  filter(word == "viralarchive") %>%
  pull(bbox) %>%
  str_split(",") %>%
  unlist() %>%
  as.numeric()

library(magick)
imgm <- magick::image_read(img)

img_w <- image_info(imgm)$width
img_h <- image_info(imgm)$height
img_h_to_tw <- bbox_tw[4] - img_h*0.12
img_h_to_va <- bbox_va[4] + 25

# widthxheight+x+y
# size of the image that remains after cropping,
# and x and y in the offset (if present) gives the
# location of the top left corner of the cropped image
# with respect to the original image.
img_c <-
image_crop(imgm,
           paste0(img_w*0.5, "x",
                  img_h_to_tw,
                  "+",
                  img_w*0.25,
                  "+",
                  img_h_to_va
                  ))

image_write(
  img_c,
  path = paste0("images-cropped/", basename(img)),
  format = NULL,
  quality = NULL,
  depth = NULL,
  density = NULL,
  comment = NULL,
  flatten = FALSE
)

}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}

```

Do object recognition on the cropped photos


```{r}
library(keras)
model <- application_xception(weights = "imagenet")
# I got an error the first time, then downloaded the model file an
# put this in ~/.keras/models : "inception_v3_weights_tf_dim_ordering_tf_kernels.h5"

# get a list of the cropped images
imgs_cropped <- list.files("images-cropped", full.names = TRUE)

# loop over all cropped images and get a list of the ten highest
# probability objects recognised in each image
output_list <- vector("list", length = length(imgs_cropped))
for(i in seq_along(imgs_cropped)){
  f <- paste0(getwd(), "/", imgs_cropped[i])
  # load the image
  img_path <- f
  img <- image_load(img_path, target_size = c(299,299))
  x <- image_to_array(img)

  # ensure we have a 4d tensor with single element in the batch dimension,
  # the preprocess the input for prediction using resnet50
  x <- array_reshape(x, c(1, dim(x)))
  x <- xception_preprocess_input(x)

  # make predictions then decode and print them
  preds <- model %>% predict(x)
  output_list[[i]] <- imagenet_decode_predictions(preds, top = 10)[[1]]
}

# tidy output into data frame
wide_out <-
bind_rows(output_list,
       .id = "image")  %>%
  select(-class_name) %>%
  pivot_wider(image,
              names_from = "class_description",
              values_from = "score",
              values_fill = 0)
```

Now we have attributes for each image, in the form of the objects recognised in them, we can see how images with similar objects cluster together. We use t-SNE for reducing the dimensionality, and then plot the images to see how groupings appear.

```{r}
library(Rtsne)
set.seed(42)
tsne_out <-
  Rtsne(wide_out,
        pca=TRUE,
        perplexity=30,
        theta=0.0) # Run TSNE

tsne_out_plot <-
tibble(x = tsne_out$Y[,1],
       y = tsne_out$Y[,2],
       img = paste0(getwd(), "/",  imgs_cropped)
)

library(ggimage)
ggplot(tsne_out_plot) +
  aes(x, y) +
  geom_image(aes(image=img),
             size=.025) +
  theme_void()
```


	I used the Python library GetOldTweets3 to get the tweets because the rtweet package cannot get tweets older than 6-9 days. Details about this Python library are here: https://github.com/Mottl/GetOldTweets3

	I used this line in the shell to get tweets using the #viralarchive hashtag:

	```{bash, engine.opts="-l", eval = F}
	GetOldTweets3 --querysearch 'viralarchive' --maxtweets 10000
	```

	The Python function saves the tweets into a CSV file. We can read it into R like this:

	```{r}
	library(tidyverse)
	library(lubridate)
	va <- readr::read_csv("output_got.csv") %>%
	filter(ymd_hms(date) > "2019-01-01")
	```

	Here is a time series plot of tweets per day:

	```{r}
	va %>%
	mutate(day = round_date(ymd_hms(date), "day")) %>%
	group_by(day) %>%
	summarise(n_tweets = n()) %>%
	drop_na %>%
	ggplot() +
	aes(day, n_tweets) +
	geom_line()
	```


	Here we try to get the photos from each tweet that has them. This is not easy because the Python function doesn't give a us a specific column of media attached to the tweet. Also Twitter make it very difficult to scrape content automatically.

	Here we pretend that we are an iPhone because the tweet layout is very simple on that device. Then we take a screenshot

	```{r}
	library(rvest)
	library(httr)
	httr::user_agent('Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_2 like Mac OS X; en-us)')

	url <- va$permalink

	library(webshot)
	webshot(url,
	file = paste0("images/", basename(url), ".jpg"),
	delay = 0.1,
	# top, left, width, and height.
	# cliprect = c(150,
	# 200,
	# 600,
	# 600),
	useragent = 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_2 like Mac OS X; en-us)')
	```

	Try to crop the screenshots to focus on the image. We look for the locaiton of the #viralarchive text at the top, and the 'Twitter' text at the bottom. It's not very good, but it's better than nothing, maybe.


	```{r}
	library(tesseract)
	eng <- tesseract("eng")

	imgs <- list.files("images", full.names = TRUE)

	for(img in imgs){

	tryCatch({

	text_ocr <-
	tesseract::ocr_data(img,
	engine = eng)

	bbox_tw <-
	text_ocr %>%
	mutate(word = tolower(str_replace_all(word, "[^a-zA-Z]+", ""))) %>%
	filter(word == "twitter") %>%
	pull(bbox) %>%
	str_split(",") %>%
	unlist() %>%
	as.numeric()

	bbox_va <-
	text_ocr %>%
	mutate(word = tolower(str_replace_all(word, "[^a-zA-Z]+", ""))) %>%
	filter(word == "viralarchive") %>%
	pull(bbox) %>%
	str_split(",") %>%
	unlist() %>%
	as.numeric()

	library(magick)
	imgm <- magick::image_read(img)

	img_w <- image_info(imgm)$width
	img_h <- image_info(imgm)$height
	img_h_to_tw <- bbox_tw[4] - img_h*0.12
	img_h_to_va <- bbox_va[4] + 25

	# widthxheight+x+y
	# size of the image that remains after cropping,
	# and x and y in the offset (if present) gives the
	# location of the top left corner of the cropped image
	# with respect to the original image.
	img_c <-
	image_crop(imgm,
	paste0(img_w*0.5, "x",
	img_h_to_tw,
	"+",
	img_w*0.25,
	"+",
	img_h_to_va
	))

	image_write(
	img_c,
	path = paste0("images-cropped/", basename(img)),
	format = NULL,
	quality = NULL,
	depth = NULL,
	density = NULL,
	comment = NULL,
	flatten = FALSE
	)

	}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
	}

	```

	Do object recognition on the cropped photos


	```{r}
	library(keras)
	model <- application_xception(weights = "imagenet")
	# I got an error the first time, then downloaded the model file an
	# put this in ~/.keras/models : "inception_v3_weights_tf_dim_ordering_tf_kernels.h5"

	# get a list of the cropped images
	imgs_cropped <- list.files("images-cropped", full.names = TRUE)

	# loop over all cropped images and get a list of the ten highest
	# probability objects recognised in each image
	output_list <- vector("list", length = length(imgs_cropped))
	for(i in seq_along(imgs_cropped)){
	f <- paste0(getwd(), "/", imgs_cropped[i])
	# load the image
	img_path <- f
	img <- image_load(img_path, target_size = c(299,299))
	x <- image_to_array(img)

	# ensure we have a 4d tensor with single element in the batch dimension,
	# the preprocess the input for prediction using resnet50
	x <- array_reshape(x, c(1, dim(x)))
	x <- xception_preprocess_input(x)

	# make predictions then decode and print them
	preds <- model %>% predict(x)
	output_list[[i]] <- imagenet_decode_predictions(preds, top = 10)[[1]]
	}

	# tidy output into data frame
	wide_out <-
	bind_rows(output_list,
	.id = "image") %>%
	select(-class_name) %>%
	pivot_wider(image,
	names_from = "class_description",
	values_from = "score",
	values_fill = 0)
	```

	Now we have attributes for each image, in the form of the objects recognised in them, we can see how images with similar objects cluster together. We use t-SNE for reducing the dimensionality, and then plot the images to see how groupings appear.

	```{r}
	library(Rtsne)
	set.seed(42)
	tsne_out <-
	Rtsne(wide_out,
	pca=TRUE,
	perplexity=30,
	theta=0.0) # Run TSNE

	tsne_out_plot <-
	tibble(x = tsne_out$Y[,1],
	y = tsne_out$Y[,2],
	img = paste0(getwd(), "/", imgs_cropped)
	)

	library(ggimage)
	ggplot(tsne_out_plot) +
	aes(x, y) +
	geom_image(aes(image=img),
	size=.025) +
	theme_void()
	```