Skip to content

Instantly share code, notes, and snippets.

@kbroman
Last active August 29, 2015 14:06
Show Gist options
  • Save kbroman/a35172029b7a319d74c5 to your computer and use it in GitHub Desktop.
Save kbroman/a35172029b7a319d74c5 to your computer and use it in GitHub Desktop.
> test()
Loading required package: testthat
Testing aRxiv
Loading aRxiv
arxiv_errors : ...
arxiv_search in batches : ..
cleaning the records : ...................
search range of dates : ....
basic searches : .........
sort_by and sort_order args work : ...1
is_too_many : ...
1. Failure(@test-sort.R#42): sort by lastUpdatedDate --------------------------------------------------------------
zr$updated not equal to expected
Lengths (2, 0) differ (string compare on first 0)
> test()
Testing aRxiv
Loading aRxiv
arxiv_errors : ...
arxiv_search in batches : ..
cleaning the records : ...................
search range of dates : ....
basic searches : .........
sort_by and sort_order args work : 1234
is_too_many : ...
1. Failure(@test-sort.R#15): sort by publishedDate ----------------------------------------------------------------
z$submitted not equal to expected
Lengths (2, 1) differ (string compare on first 1)
2. Failure(@test-sort.R#20): sort by publishedDate ----------------------------------------------------------------
zr$submitted not equal to expected
Lengths (2, 1) differ (string compare on first 1)
3. Failure(@test-sort.R#37): sort by lastUpdatedDate --------------------------------------------------------------
z$updated not equal to expected
Lengths (2, 0) differ (string compare on first 0)
4. Failure(@test-sort.R#42): sort by lastUpdatedDate --------------------------------------------------------------
zr$updated not equal to expected
Lengths (2, 0) differ (string compare on first 0)
> test()
Testing aRxiv
Loading aRxiv
arxiv_errors : ...
arxiv_search in batches : ..
cleaning the records : ...................
search range of dates : ....
basic searches : .........
sort_by and sort_order args work : ....
is_too_many : ...
# getting variable responses from arXiv API when requesting sorted results
# the following makes same request 4 times in a row
if(!require(httr)) install.packages(httr)
if(!require(devtools)) install.packages(devtools)
library(devtools)
if(!require(aRxiv)) install_github("ropensci/aRxiv")
library(httr)
library(aRxiv)
# problem query
# http://export.arxiv.org/api/query?search_query=ti:deconvolution+AND+submittedDate:[199001010000+TO+201409062400]&max_results=2
repeated_search <-
function(query, sort_by=c("submittedDate", "lastUpdatedDate", "relevance"),
ascending=TRUE, n.tries=50, delay=1, limit=2, start=0, verbose=FALSE)
{
query_url <- "http://export.arxiv.org/api/query"
options(aRxiv_delay=delay)
sort_by <- match.arg(sort_by)
sort_order <- ifelse(ascending, "ascending", "descending")
raw_result <- tab_result <- vector("list", n.tries)
for(s in 1:n.tries) {
if(verbose) message("try ", s)
aRxiv:::delay_if_necessary()
raw_result[[s]] <- POST(query_url, body=list(search_query=query,
max_results=limit, start=start,
sortBy=sort_by, sortOrder=sort_order))
tab_result[[s]] <- aRxiv:::listresult2df( aRxiv:::get_entries( aRxiv:::result2list(raw_result[[s]])) )
}
list(raw_result=raw_result, tab_result=tab_result)
}
time_query <- "ti:deconvolution AND submittedDate:[199001010000 TO 201409062400]"
other_query <- "ti:deconvolution"
results_timequery <- repeated_search(time_query, n.tries=100, verbose=TRUE)
results_otherquery <- repeated_search(other_query, n.tries=100, verbose=TRUE)
save(results_timequery, results_otherquery, file="results.tgz")
# same number of rows for each?
sapply(results_timequery$tab_result, nrow)
sapply(results_otherquery$tab_result, nrow)
@kbroman
Copy link
Author

kbroman commented Sep 9, 2014

Related to my work on the aRxiv package, for access to the arXiv API, I'm finding that I get variable results when I use submittedDate ranges in the query.

I initially thought the problem had to do with using sortBy and sortOrder, but it seems like it's the use of submittedDate in the query itself that is the issue.

The time_query here returns a single entry 15% of the time but most of the time two entries. A separate gist shows the actual XML results for a case in which one entry was returned and another case in which two entries were returned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment