Skip to content

Instantly share code, notes, and snippets.

@portableant
Last active May 28, 2022 08:55
Show Gist options
  • Save portableant/e21a06b0b63cafc74d8405112d0eb75f to your computer and use it in GitHub Desktop.
Save portableant/e21a06b0b63cafc74d8405112d0eb75f to your computer and use it in GitHub Desktop.
Download data

Scraping data from ESRI Arcview maps/ extract images from Web resources

NOTE - The rscripts are not optimised, they were just created to prove this can be done

Problem to solve - extract all the records from a webmap on Historic England's website for aerial photography and create a CSV file. This relates to a section 21 refusal of a FOIA request by Andy Mabbett: https://www.whatdotheyknow.com/request/847052/response/2037681/attach/html/4/Mr%20Mabbett%20FOI.docx.html

He has uploaded an example image to wikicommons here: https://commons.wikimedia.org/wiki/File:Historic_England_Aerial_Photo_Explorer_-_raf_540_78_sffo_0003_-_screenshot_-_01.png

Solution

All scripts have been written and tested on Intel based Mac OSX, with latest R, RStudio, Docker, and image magick. Guides to install these packages can be found elsewhere. Most packages I have installed via homebrew.

Oblique photos

  1. Look at this StackOverflow Q&A: https://stackoverflow.com/questions/50161492/how-do-i-scrape-data-from-an-arcgis-online-map
  2. Get the ID for the application: 9adb70fef4fa4844ba0e091a12e66455
  3. Find the URL for the layer you want to query: https://services-eu1.arcgis.com/ZOdPfBS3aqqDYPUQ/arcgis/rest/services/Obliques_03_02_2022_WGS84_Date_view/FeatureServer/29
  4. Work out what you want to query for using the HTML page https://services-eu1.arcgis.com/ZOdPfBS3aqqDYPUQ/arcgis/rest/services/Obliques_03_02_2022_WGS84_Date_view/FeatureServer/29/query?where=0%3D0&outFields=%2A&f=html
  5. Write a script to page through the queried data set and create a CSV file with only the columns you need and a newly generated URL column.

Final script: See scrapePhotoDataEsri.R

Constraints on this - default query limit 50, total records 394,390.
I didn't want to just search for RAF photos as there maybe more useful information.

To parse just RAF photography:

https://services-eu1.arcgis.com/ZOdPfBS3aqqDYPUQ/arcgis/rest/services/Obliques_03_02_2022_WGS84_Date_view/FeatureServer/29/query?where=PHOTOGRAPHER%3D%27RAF%27&outFields=%2A&f=html

The final parameter can be html|pjson|pgeojson

Vertical photos

  1. Look at this StackOverflow Q&A: https://stackoverflow.com/questions/50161492/how-do-i-scrape-data-from-an-arcgis-online-map
  2. Get the ID for the application: 9adb70fef4fa4844ba0e091a12e66455
  3. Find the URL for the layer you want to query: https://services-eu1.arcgis.com/ZOdPfBS3aqqDYPUQ/arcgis/rest/services/Verts_07_02_22_WGS84_Date_view/FeatureServer/32/
  4. Work out what you want to query for using the HTML page https://services-eu1.arcgis.com/ZOdPfBS3aqqDYPUQ/arcgis/rest/services/Verts_07_02_22_WGS84_Date_view/FeatureServer/32/query?where=0%3D0&outFields=%2A&f=html
  5. Write a script to page through the queried data set and create a CSV file with only the columns you need and a newly generated URL column.

Final script: See scrapeVertical.R

Constraints on this - default query limit 50, total records 47,230.
I didn't want to just search for RAF photos as there maybe more useful information.

Geospatial conversion

Each image, in theory has a British National Grid Easting/Northing pair. These get converted to a latlng pair in the scripts below.

Parameters available via esri REST api

The final parameter can be html|pjson|pgeojson

Visualise data

You can visualise these data using Simon Willison's Datasette libraries via: https://aerial-photos-exploration.glitch.me/data/ The mapping plugin is active in this instance, and it may take a few minutes to wake up.

Remote downloading

Using Rselenium and docker selenium server, it is possible to automate the download of these images. My docker knowledge is a little noddy, but this is how I got it working.

  1. Install Docker
  2. Pull a docker standalone Selenium image (I'm using this one https://github.com/SeleniumHQ/docker-selenium)
 docker pull selenium/standalone-chrome
  1. Map a directory on your machine for storing downloaded images - I mapped a directory in my home folder.
  2. Start docker with selenium standalone chrome using a command like the below:
docker run -d -p 4444:4444 -v /Users/Danielpett/rafImages:/home/seluser/Downloads --shm-size="2g" selenium/standalone-chrome:4.1.4-20220427
  1. Use the automatedDownload.R script to get an image using this example code https://gist.github.com/stuartlangridge/82c87a601a7e2ae566e640d93138ed85
  2. Downloaded images will appear in the mapped volume

Batch downloading

Another script - batchAutomated.R allows one to download all the images (may take a long time). This script downloads images and converts them to jpgs using imagick. Timeout increases were needed for this to capture decent size images and to also pause processing whilst images are saved.

library(RSelenium)
remDr <- remoteDriver(
remoteServerAddr = "127.0.0.1",
browserName="chrome",
port = 4444L,
)
remDr$open()
remDr$navigate("https://historicengland.org.uk/images-books/archive/collections/aerial-photos/record/HEA_S3288_V_0007")
remDr$getTitle()
filename <- 'HEA_S3288_V_0007'
js <- paste0("var ap = document.querySelector('div.articlePage.container');ap.style.width='6000px';ap.style.maxWidth='6000px';setTimeout(function() {var c = document.querySelector('canvas.stage');
var u = document.createElement('canvas').toDataURL.call(c);
var a = document.createElement('a');
a.href = u;
a.download = '",filename, "';
a.click();
}, 6000);")
print(js)
remDr$executeScript(js, args = list())
remDr$close()
library(RSelenium)
library(magick)
records <- read.table(file = "raf_verticals.csv",sep = ",",header = TRUE)
meta <- records[,c("Filename", "url")]
print(nrow(meta))
remDr <- remoteDriver(
remoteServerAddr = "127.0.0.1",
browserName="chrome",
port = 4444L,
)
## Open the session
remDr$open()
## Loop through the rows
for (i in 1:nrow(meta)) {
urlPath <- paste0(meta[i, "url"])
filename <- paste0(meta[i, "Filename"],'.png')
path <- paste0('/Users/Danielpett/rafImages/',filename)
jpgPath <- paste0('/Users/Danielpett/rafImages/', meta[i, "Filename"],'.jpg')
if(!file.exists(path) && !file.exists(jpgPath)) {
remDr$navigate(urlPath)
js <- paste0("var ap = document.querySelector('div.articlePage.container');ap.style.width='6000px';ap.style.maxWidth='6000px';setTimeout(function() {var c = document.querySelector('canvas.stage');
var u = document.createElement('canvas').toDataURL.call(c);
var a = document.createElement('a');
a.href = u;
a.download = '",filename, "';
a.click();
}, 20000);")
remDr$executeScript(js, args = list('fugazi'))
Sys.sleep(30)
if(file.exists(path)) {
Sys.sleep(10)
png <- image_read(path)
print(image_info(png))
image_write(png, path = jpgPath, format = "jpeg", quality = 100)
unlink(path)
}
}
}
## Close the session
remDr$close()
library(jsonlite)
library(sp)
library(rgdal)
total<-394390
recordsToReturn<-50
pagination<-ceiling(total/recordsToReturn)
url<-"https://services-eu1.arcgis.com/ZOdPfBS3aqqDYPUQ/arcgis/rest/services/Obliques_03_02_2022_WGS84_Date_view/FeatureServer/29/query?where=0%3D0&outFields=%2A&f=json&&resultRecordCount=50"
json <- fromJSON(url)
imageUrl<-'https://historicengland.org.uk/images-books/archive/collections/aerial-photos/record/'
data <- json$features$attributes
keeps <- c("OBJECTID","PLACE_NAME","COMMENTS","BORN_DIGITAL", "DATE_LAST_MODIFIED", "COPYRIGHT_CODE", "FLIGHT_NUMBER", "DATE_FLOWN", "YR","YEAR","MONTH","PHOTOGRAPHER","COPYRIGHT","EASTING","NORTHING","IMAGE_MEDIA_VALUE","IMAGE_FORMAT_VALUE","FILENAME","REPOSITORY_CODE","PILOT")
data <- data[,(names(data) %in% keeps)]
for (i in seq(from=(5506 * recordsToReturn), to=(pagination*recordsToReturn), by=recordsToReturn)){
urlDownload <- paste(url, '&resultOffset=', i, sep='')
print(urlDownload)
pagedJson <- fromJSON(urlDownload)
records <- pagedJson$features$attributes
records <- records[,(names(records) %in% keeps)]
data <-rbind(data,records)
Sys.sleep(1.0)
}
data$url<-paste0(imageUrl,data$FILENAME)
df = data[order(data[,'OBJECTID']),]
df = df[!duplicated(df$OBJECTID),]
ukgrid = "+init=epsg:27700"
latlong = "+init=epsg:4326"
pointData <- subset(df, select = c("EASTING","NORTHING"))
coords <- cbind(EASTING = as.numeric(as.character(pointData$EASTING)),NORTHING = as.numeric(as.character(pointData$NORTHING)))
east_north <- SpatialPoints(coords,proj4string=CRS(ukgrid))
latlon <- spTransform(east_north, CRS(latlong))
df$lat <- latlon$EASTING
df$lon <- latlon$NORTHING
write.csv(df, file='aerial.csv',row.names=FALSE, na="")
raf <- subset(df, PHOTOGRAPHER=="RAF")
write.csv(raf, file='raf.csv',row.names=FALSE, na="")
library(jsonlite)
library(sp)
library(rgdal)
total<-47230
recordsToReturn<-50
pagination<-ceiling(total/recordsToReturn)
url<-"https://services-eu1.arcgis.com/ZOdPfBS3aqqDYPUQ/arcgis/rest/services/Verts_07_02_22_WGS84_Date_view/FeatureServer/32/query?where=0%3D0&outFields=%2A&f=json&&resultRecordCount=50"
json <- fromJSON(url)
imageUrl<-'https://historicengland.org.uk/images-books/archive/collections/aerial-photos/record/'
data <- json$features$attributes
keeps <- c("OBJECTID","PLACE_NAME","COMMENTS","BORN_DIGITAL", "DATE_LAST_MODIFIED", "COPYRIGHT_CODE", "FLIGHT_NUMBER", "DATE_FLOWN", "YR","YEAR","MONTH","PHOTOGRAPHER","COPYRIGHT","EASTING","NORTHING","IMAGE_MEDIA_VALUE","IMAGE_FORMAT_VALUE","Filename","REPOSITORY_CODE","PILOT","FLYING_HEIGHT", "SCALE","FRAME_NUMBER")
data <- data[,(names(data) %in% keeps)]
for (i in seq(from=(2 * recordsToReturn), to=(pagination*recordsToReturn), by=recordsToReturn)){
urlDownload <- paste(url, '&resultOffset=', i, sep='')
print(urlDownload)
pagedJson <- fromJSON(urlDownload)
records <- pagedJson$features$attributes
records <- records[,(names(records) %in% keeps)]
head(records)
data <-rbind(data,records)
Sys.sleep(1.0)
}
data$url<-paste0(imageUrl,data$Filename)
ukgrid = "+init=epsg:27700"
latlong = "+init=epsg:4326"
pointData <- subset(df, select = c("EASTING","NORTHING"))
coords <- cbind(EASTING = as.numeric(as.character(pointData$EASTING)),NORTHING = as.numeric(as.character(pointData$NORTHING)))
east_north <- SpatialPoints(coords,proj4string=CRS(ukgrid))
head(east_north)
latlon <- spTransform(east_north, CRS(latlong))
df$lat <- latlon$EASTING
df$lon <- latlon$NORTHING
df = data[order(data[,'OBJECTID']),]
df = df[!duplicated(df$OBJECTID),]
write.csv(df, file='aerialVerts.csv',row.names=FALSE, na="")
raf <- subset(df, PHOTOGRAPHER=="RAF")
write.csv(raf, file='raf_verticals.csv',row.names=FALSE, na="")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment