Skip to content

Instantly share code, notes, and snippets.

@sckott
Last active August 29, 2015 14:13
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sckott/675cf255f49f37032683 to your computer and use it in GitHub Desktop.
Save sckott/675cf255f49f37032683 to your computer and use it in GitHub Desktop.

The elastic package is an R client for Elasticsearch. Somebody today suggested that we incorporate scrolling search - supposed to be much faster than paging via the from and size parameters (issue here).

So here it is. You have to do the paging yourself with the scroll id, which seems in keeping with the ES Python client, but should it be done automagically in the scroll() function?

The example below uses an Elasticsearch index of 20,000 documents, grabbed from the PLOS search API.

If you want to replicate this with the same data, the zip file is at http://cl.ly/ZK7W/download/plos_big_data.json.zip - and you can use the Elasticsearch bulk API to load the data into your ES engine. Let me know if you want help doing that...

Installation, loading

install.packages("devtools")
devtools::install_github("ropensci/elastic")
library("elastic")

Scrolling

Get a scroll_id

res <- Search(index = 'plosbigdata', scroll="1m")
res$`_scroll_id`
#> [1] "cXVlcnlUaGVuRmV0Y2g7NTs1MTExOjB4bkVpY3d1U3NLeGtQcHR3LWJma2c7NTExMjoweG5FaWN3dVNzS3hrUHB0dy1iZmtnOzUxMTM6MHhuRWljd3VTc0t4a1BwdHctYmZrZzs1MTE0OjB4bkVpY3d1U3NLeGtQcHR3LWJma2c7NTExNToweG5FaWN3dVNzS3hrUHB0dy1iZmtnOzA7"

Setting search_type="scan" turns off sorting of results, is faster

res <- Search(index = 'plosbigdata', scroll="1m", search_type = "scan")
res$`_scroll_id`
#> [1] "c2Nhbjs1OzUxMTY6MHhuRWljd3VTc0t4a1BwdHctYmZrZzs1MTE3OjB4bkVpY3d1U3NLeGtQcHR3LWJma2c7NTExODoweG5FaWN3dVNzS3hrUHB0dy1iZmtnOzUxMTk6MHhuRWljd3VTc0t4a1BwdHctYmZrZzs1MTIwOjB4bkVpY3d1U3NLeGtQcHR3LWJma2c7MTt0b3RhbF9oaXRzOjIwMDAwOw=="

Pass scroll_id to scroll function

scroll(scroll_id = res$`_scroll_id`)
#> $`_scroll_id`
#> [1] "c2Nhbjs1OzUxMTY6MHhuRWljd3VTc0t4a1BwdHctYmZrZzs1MTE3OjB4bkVpY3d1U3NLeGtQcHR3LWJma2c7NTExODoweG5FaWN3dVNzS3hrUHB0dy1iZmtnOzUxMTk6MHhuRWljd3VTc0t4a1BwdHctYmZrZzs1MTIwOjB4bkVpY3d1U3NLeGtQcHR3LWJma2c7MTt0b3RhbF9oaXRzOjIwMDAwOw=="
#> 
#> $took
#> [1] 1
#> 
#> $timed_out
#> [1] FALSE
#> 
#> $`_shards`
#> $`_shards`$total
#> [1] 5
#> 
#> $`_shards`$successful
#> [1] 5
#> 
#> $`_shards`$failed
#> [1] 0
#> 
#> 
#> $hits
#> $hits$total
#> [1] 20000
#> 
#> $hits$max_score
#> [1] 0
#> 
#> $hits$hits
#> $hits$hits[[1]]
#> $hits$hits[[1]]$`_index`
#> [1] "plosbigdata"
#> 
#> $hits$hits[[1]]$`_type`
#> [1] "article"
#> 
#> $hits$hits[[1]]$`_id`
#> [1] "4"
#> 
#> $hits$hits[[1]]$`_version`
#> [1] 1
#> 
#> $hits$hits[[1]]$`_score`
#> [1] 0
#> 
#> $hits$hits[[1]]$`_source`
#> $hits$hits[[1]]$`_source`$id
#> [1] "10.1371/journal.pone.0116236"
#> 
#> $hits$hits[[1]]$`_source`$journal
#> [1] "PLOS ONE"
#> 
#> $hits$hits[[1]]$`_source`$author
#> [1] "Hsiao-Sang Chu,Shan-Chwen Chang,Elizabeth P Shen,Fung-Rong Hu"
#> 
#> $hits$hits[[1]]$`_source`$abstract
#> [1] "Purpose: To analyze the clinical characteristics of nontuberculous mycobacterial (NTM) ocular infections and the species-specific in vitro antimicrobial susceptibility. Material and Methods: In 2000 to 2011 at the National Taiwan University Hospital, multilocus sequencing of rpoB, hsp65 and secA was used to identify NTM isolates from ocular infections. The clinical presentation and treatment outcomes were retrospectively compared between species. Broth microdilution method was used to determine the minimum inhibitory concentrations of amikacin (AMK), clarithromycin (CLA), ciprofloxacin (CPF), levofloxacin (LVF), moxifloxacin (MXF) and gatifloxacin (GAF) against all strains. The activities of antimicrobial combinations were assessed by the checkerboard titration method. Results: A total of 24 NTM strains (13 Mycobacterium abscessus and 11 Mycobacterium massiliense) were isolated from 13 keratitis, 10 buckle infections, and 1 canaliculitis cases. Clinically, manifestations and outcomes caused by these two species were similar and surgical intervention was necessary for medically unresponsive NTM infection. Microbiologically, 100% of M. abscessus and 90.9% of M. massiliense ocular isolates were susceptible to amikacin but all were resistant to fluoroquinolones. Inducible clarithromycin resistance existed in 69.3% of M. abscessus but not in M. massiliense isolates. None of the AMK-CLA, AMK-MXF, AMK-GAF, CLA-MXF and CLA-GAF combinations showed synergistic or antagonistic effect against both species in vitro. Conclusions: M. abscessus and M. massiliense are the most commonly identified species for NTM ocular infections in Taiwan. Both species were resistant to fluoroquinolones, susceptible to amikacin, and differ in clarithromycin resistance. Combined antimicrobial treatments showed no interaction in vitro but could be considered in combination with surgical interventions for eradication of this devastating ocular infection. "
#> 
#> $hits$hits[[1]]$`_source`$title
#> [1] "Nontuberculous Mycobacterial Ocular Infections—Comparing the Clinical and Microbiological Characteristics between Mycobacterium abscessus and Mycobacterium massiliense"
#> 
#> 
#> 
#> $hits$hits[[2]]
#> $hits$hits[[2]]$`_index`
#> [1] "plosbigdata"
#> 
#> $hits$hits[[2]]$`_type`
#> [1] "article"
#> 
#> $hits$hits[[2]]$`_id`
#> [1] "9"
#> 
#> $hits$hits[[2]]$`_version`
#> [1] 1
#> 
#> $hits$hits[[2]]$`_score`
#> [1] 0
#> 
#> $hits$hits[[2]]$`_source`
#> $hits$hits[[2]]$`_source`$id
#> [1] "10.1371/journal.pone.0117137"
#> 
#> $hits$hits[[2]]$`_source`$journal
#> [1] "PLOS ONE"
#> 
#> $hits$hits[[2]]$`_source`$author
#> [1] "Brian C Gunia,J Keith Murnighan"
#> 
#> $hits$hits[[2]]$`_source`$abstract
#> [1] "\nEven the simplest choices can prompt decision-makers to balance their preferences against other, more pragmatic considerations like price. Thus, discerning people’s preferences from their decisions creates theoretical, empirical, and practical challenges. The current paper addresses these challenges by highlighting some specific circumstances in which the amount of time that people spend examining potential purchase items (i.e., viewing time) can in fact reveal their preferences. Our model builds from the gazing literature, in a purchasing context, to propose that the informational value of viewing time depends on prices. Consistent with the model’s predictions, four studies show that when prices are absent or moderate, viewing time provides a signal that is consistent with a person’s preferences and purchase intentions. When prices are extreme or consistent with a person’s preferences, however, viewing time is a less reliable predictor of either. Thus, our model highlights a price-contingent “viewing bias,” shedding theoretical, empirical, and practical light on the psychology of preferences and visual attention, and identifying a readily observable signal of preference.\n"
#> 
#> $hits$hits[[2]]$`_source`$title
#> [1] "The Tell-Tale Look: Viewing Time, Preferences, and Prices"
#> 

Get all results - one approach is to use a while loop

res <- Search(index = 'plosbigdata', scroll="5m", search_type = "scan")
out <- list()
hits <- 1
while(hits != 0){
  res <- scroll(scroll_id = res$`_scroll_id`)
  hits <- length(res$hits$hits)
  if(hits > 0)
    out <- c(out, res$hits$hits)
}
lapply(out, "[[", "_source")[1:2]
#> [[1]]
#> [[1]]$id
#> [1] "10.1371/journal.pone.0116236"
#> 
#> [[1]]$journal
#> [1] "PLOS ONE"
#> 
#> [[1]]$author
#> [1] "Hsiao-Sang Chu,Shan-Chwen Chang,Elizabeth P Shen,Fung-Rong Hu"
#> 
#> [[1]]$abstract
#> [1] "Purpose: To analyze the clinical characteristics of nontuberculous mycobacterial (NTM) ocular infections and the species-specific in vitro antimicrobial susceptibility. Material and Methods: In 2000 to 2011 at the National Taiwan University Hospital, multilocus sequencing of rpoB, hsp65 and secA was used to identify NTM isolates from ocular infections. The clinical presentation and treatment outcomes were retrospectively compared between species. Broth microdilution method was used to determine the minimum inhibitory concentrations of amikacin (AMK), clarithromycin (CLA), ciprofloxacin (CPF), levofloxacin (LVF), moxifloxacin (MXF) and gatifloxacin (GAF) against all strains. The activities of antimicrobial combinations were assessed by the checkerboard titration method. Results: A total of 24 NTM strains (13 Mycobacterium abscessus and 11 Mycobacterium massiliense) were isolated from 13 keratitis, 10 buckle infections, and 1 canaliculitis cases. Clinically, manifestations and outcomes caused by these two species were similar and surgical intervention was necessary for medically unresponsive NTM infection. Microbiologically, 100% of M. abscessus and 90.9% of M. massiliense ocular isolates were susceptible to amikacin but all were resistant to fluoroquinolones. Inducible clarithromycin resistance existed in 69.3% of M. abscessus but not in M. massiliense isolates. None of the AMK-CLA, AMK-MXF, AMK-GAF, CLA-MXF and CLA-GAF combinations showed synergistic or antagonistic effect against both species in vitro. Conclusions: M. abscessus and M. massiliense are the most commonly identified species for NTM ocular infections in Taiwan. Both species were resistant to fluoroquinolones, susceptible to amikacin, and differ in clarithromycin resistance. Combined antimicrobial treatments showed no interaction in vitro but could be considered in combination with surgical interventions for eradication of this devastating ocular infection. "
#> 
#> [[1]]$title
#> [1] "Nontuberculous Mycobacterial Ocular Infections—Comparing the Clinical and Microbiological Characteristics between Mycobacterium abscessus and Mycobacterium massiliense"
#> 
#> 
#> [[2]]
#> [[2]]$id
#> [1] "10.1371/journal.pone.0117137"
#> 
#> [[2]]$journal
#> [1] "PLOS ONE"
#> 
#> [[2]]$author
#> [1] "Brian C Gunia,J Keith Murnighan"
#> 
#> [[2]]$abstract
#> [1] "\nEven the simplest choices can prompt decision-makers to balance their preferences against other, more pragmatic considerations like price. Thus, discerning people’s preferences from their decisions creates theoretical, empirical, and practical challenges. The current paper addresses these challenges by highlighting some specific circumstances in which the amount of time that people spend examining potential purchase items (i.e., viewing time) can in fact reveal their preferences. Our model builds from the gazing literature, in a purchasing context, to propose that the informational value of viewing time depends on prices. Consistent with the model’s predictions, four studies show that when prices are absent or moderate, viewing time provides a signal that is consistent with a person’s preferences and purchase intentions. When prices are extreme or consistent with a person’s preferences, however, viewing time is a less reliable predictor of either. Thus, our model highlights a price-contingent “viewing bias,” shedding theoretical, empirical, and practical light on the psychology of preferences and visual attention, and identifying a readily observable signal of preference.\n"
#> 
#> [[2]]$title
#> [1] "The Tell-Tale Look: Viewing Time, Preferences, and Prices"

Time comparison

With 5 shards on my local box, default size parameter results in 50 results returned per request, so we'll use size of 50 for the comparison.

Using paging

res <- Search(index = 'plosbigdata', size=0)$hits$total
system.time( 
  res_paging <- lapply(seq(1, res, 50), function(x) 
    Search(index = 'plosbigdata', size=50, from=x)$hits$hits)
)
#>    user  system elapsed 
#>  13.742   0.557  18.146

Using scrolling

res <- Search(index = 'plosbigdata', scroll="5m", search_type = "scan")
out <- list()
hits <- 1
system.time( 
  while(hits != 0){
    res <- scroll(scroll_id = res$`_scroll_id`)
    hits <- length(res$hits$hits)
    if(hits > 0)
      out <- c(out, res$hits$hits)
  }
)
#>    user  system elapsed 
#>  12.378   0.474  13.874
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment