Skip to content

Instantly share code, notes, and snippets.

@benmarwick
Created June 1, 2021 16:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save benmarwick/84683c7ba0f0e4cdc50d28652bd39afc to your computer and use it in GitHub Desktop.
Save benmarwick/84683c7ba0f0e4cdc50d28652bd39afc to your computer and use it in GitHub Desktop.
Object recognition in Images in #viralarchive tweets
I used the Python library GetOldTweets3 to get the tweets because the rtweet package cannot get tweets older than 6-9 days. Details about this Python library are here: https://github.com/Mottl/GetOldTweets3
I used this line in the shell to get tweets using the #viralarchive hashtag:
```{bash, engine.opts="-l", eval = F}
GetOldTweets3 --querysearch 'viralarchive' --maxtweets 10000
```
The Python function saves the tweets into a CSV file. We can read it into R like this:
```{r}
library(tidyverse)
library(lubridate)
va <- readr::read_csv("output_got.csv") %>%
filter(ymd_hms(date) > "2019-01-01")
```
Here is a time series plot of tweets per day:
```{r}
va %>%
mutate(day = round_date(ymd_hms(date), "day")) %>%
group_by(day) %>%
summarise(n_tweets = n()) %>%
drop_na %>%
ggplot() +
aes(day, n_tweets) +
geom_line()
```
Here we try to get the photos from each tweet that has them. This is not easy because the Python function doesn't give a us a specific column of media attached to the tweet. Also Twitter make it very difficult to scrape content automatically.
Here we pretend that we are an iPhone because the tweet layout is very simple on that device. Then we take a screenshot
```{r}
library(rvest)
library(httr)
httr::user_agent('Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_2 like Mac OS X; en-us)')
url <- va$permalink
library(webshot)
webshot(url,
file = paste0("images/", basename(url), ".jpg"),
delay = 0.1,
# top, left, width, and height.
# cliprect = c(150,
# 200,
# 600,
# 600),
useragent = 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_2 like Mac OS X; en-us)')
```
Try to crop the screenshots to focus on the image. We look for the locaiton of the #viralarchive text at the top, and the 'Twitter' text at the bottom. It's not very good, but it's better than nothing, maybe.
```{r}
library(tesseract)
eng <- tesseract("eng")
imgs <- list.files("images", full.names = TRUE)
for(img in imgs){
tryCatch({
text_ocr <-
tesseract::ocr_data(img,
engine = eng)
bbox_tw <-
text_ocr %>%
mutate(word = tolower(str_replace_all(word, "[^a-zA-Z]+", ""))) %>%
filter(word == "twitter") %>%
pull(bbox) %>%
str_split(",") %>%
unlist() %>%
as.numeric()
bbox_va <-
text_ocr %>%
mutate(word = tolower(str_replace_all(word, "[^a-zA-Z]+", ""))) %>%
filter(word == "viralarchive") %>%
pull(bbox) %>%
str_split(",") %>%
unlist() %>%
as.numeric()
library(magick)
imgm <- magick::image_read(img)
img_w <- image_info(imgm)$width
img_h <- image_info(imgm)$height
img_h_to_tw <- bbox_tw[4] - img_h*0.12
img_h_to_va <- bbox_va[4] + 25
# widthxheight+x+y
# size of the image that remains after cropping,
# and x and y in the offset (if present) gives the
# location of the top left corner of the cropped image
# with respect to the original image.
img_c <-
image_crop(imgm,
paste0(img_w*0.5, "x",
img_h_to_tw,
"+",
img_w*0.25,
"+",
img_h_to_va
))
image_write(
img_c,
path = paste0("images-cropped/", basename(img)),
format = NULL,
quality = NULL,
depth = NULL,
density = NULL,
comment = NULL,
flatten = FALSE
)
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
```
Do object recognition on the cropped photos
```{r}
library(keras)
model <- application_xception(weights = "imagenet")
# I got an error the first time, then downloaded the model file an
# put this in ~/.keras/models : "inception_v3_weights_tf_dim_ordering_tf_kernels.h5"
# get a list of the cropped images
imgs_cropped <- list.files("images-cropped", full.names = TRUE)
# loop over all cropped images and get a list of the ten highest
# probability objects recognised in each image
output_list <- vector("list", length = length(imgs_cropped))
for(i in seq_along(imgs_cropped)){
f <- paste0(getwd(), "/", imgs_cropped[i])
# load the image
img_path <- f
img <- image_load(img_path, target_size = c(299,299))
x <- image_to_array(img)
# ensure we have a 4d tensor with single element in the batch dimension,
# the preprocess the input for prediction using resnet50
x <- array_reshape(x, c(1, dim(x)))
x <- xception_preprocess_input(x)
# make predictions then decode and print them
preds <- model %>% predict(x)
output_list[[i]] <- imagenet_decode_predictions(preds, top = 10)[[1]]
}
# tidy output into data frame
wide_out <-
bind_rows(output_list,
.id = "image") %>%
select(-class_name) %>%
pivot_wider(image,
names_from = "class_description",
values_from = "score",
values_fill = 0)
```
Now we have attributes for each image, in the form of the objects recognised in them, we can see how images with similar objects cluster together. We use t-SNE for reducing the dimensionality, and then plot the images to see how groupings appear.
```{r}
library(Rtsne)
set.seed(42)
tsne_out <-
Rtsne(wide_out,
pca=TRUE,
perplexity=30,
theta=0.0) # Run TSNE
tsne_out_plot <-
tibble(x = tsne_out$Y[,1],
y = tsne_out$Y[,2],
img = paste0(getwd(), "/", imgs_cropped)
)
library(ggimage)
ggplot(tsne_out_plot) +
aes(x, y) +
geom_image(aes(image=img),
size=.025) +
theme_void()
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment