Skip to content

Instantly share code, notes, and snippets.

@gabriel-v
Last active November 9, 2016 21:13
Show Gist options
  • Save gabriel-v/469d6657276e2165697d2381a996def9 to your computer and use it in GitHub Desktop.
Save gabriel-v/469d6657276e2165697d2381a996def9 to your computer and use it in GitHub Desktop.
Draft API for the hoover batch search

Hoover search batch API

The idea is to facilitate searching for a large number of terms without hitting the rate limiter and with a decent accuracy.

The solution is to use the Elasticsearch _msearch endpoint with count operations, so the query will only return the hit count for each individual query, along with any aggregations that were requested.

The request

Send a GET request to /batch, without any parameters, and with the data formatted as JSON in the body. You can see an example of the request in src/search.js, in the search method.

Here's an example request:

{
    "aggs": {
        "approx_distinct_hash": {
            "cardinality": {
                "field": "sha1"
            }
        }
    },
    "queries": [
        { "query_string": {"query": "one"}},
        { "query_string": {"query": "two"}},
        { "query_string": {"query": "three"}},
        { "query_string": {"query": "*"}},
        { "query_string": {"query": "!@$^!^#!@@"}}
    ],
    "collections": [
        "Code",
        "Test",
        "Enron"
    ]
}

The query_string.query field should be filled by the user (each line goes into another query).

The number of queries submitted for a single request is 100. Requests with more than 100 queries will fail.

The aggs field is appended next to each of the queries sent and its result is included for each of the queries made.

The response

The response has a responses field that has the results in the same order as the queries given.

For each response object, the following data is important:

  • response.hits.total the total number of hits for that query
  • response.timed_out set if it failed
  • response._query the query object you passed in (like {"query_string":{"query": "one"}})

The response._query field is filled out so the UI doesn't have to store the queries until the response is actually returned. The UI should extract the query string (such as "one" above) and use it to:

  • show the result text
  • link to /search?q=one

The example above also includes an aggregation to approximate the number of documents that are distinct (by hash). This number varies from query to query. The approximate value is in response.aggregations.approx_distinct_hash.value.

If one of the queries fails, you won't get any of those fields set on the reponse. You will have to get the error message from response.error.root_cause[0].reason.

If response.error.root_cause is actually an empty list, you could get the error message from response.error.failed_shards[0].reason.reason. If response.error.failed_shards is actually an empty list, that means that the Elasticsearch setup is utterly broken and all hope is lost.

A sample of the data returned by the request is below.

{
  "status": "ok",
  "responses": [
    {
      "aggregations": {
        "approx_distinct_hash": {
          "value": 4051
        }
      },
      "hits": {
        "max_score": 0,
        "total": 4034,
        "hits": []
      },
      "timed_out": false,
      "_query": {
        "query_string": {
          "query": "one"
        }
      },
      "_shards": {
        "total": 10,
        "successful": 10,
        "failed": 0
      },
      "took": 31
    },
    {
      "aggregations": {
        "approx_distinct_hash": {
          "value": 2350
        }
      },
      "hits": {
        "max_score": 0,
        "total": 2350,
        "hits": []
      },
      "timed_out": false,
      "_query": {
        "query_string": {
          "query": "two"
        }
      },
      "_shards": {
        "total": 10,
        "successful": 10,
        "failed": 0
      },
      "took": 40
    },
    {
      "aggregations": {
        "approx_distinct_hash": {
          "value": 1224
        }
      },
      "hits": {
        "max_score": 0,
        "total": 1224,
        "hits": []
      },
      "timed_out": false,
      "_query": {
        "query_string": {
          "query": "three"
        }
      },
      "_shards": {
        "total": 10,
        "successful": 10,
        "failed": 0
      },
      "took": 39
    },
    {
      "aggregations": {
        "approx_distinct_hash": {
          "value": 21912
        }
      },
      "hits": {
        "max_score": 0,
        "total": 22185,
        "hits": []
      },
      "timed_out": false,
      "_query": {
        "query_string": {
          "query": "*"
        }
      },
      "_shards": {
        "total": 10,
        "successful": 10,
        "failed": 0
      },
      "took": 61
    },
    {
      "error": {
        "failed_shards": [
          {
            "reason": {
              "col": 50,
              "index": "hoover-enron-pst",
              "caused_by": {
                "type": "parse_exception",
                "caused_by": {
                  "type": "token_mgr_error",
                  "reason": "Lexical error at line 1, column 5. Encountered: \"!\" (33), after : \"\""
                },
                "reason": "Cannot parse '!@$^!^#!@@': Lexical error at line 1, column 5. Encountered: \"!\" (33), after : \"\""
              },
              "line": 1,
              "reason": "Failed to parse query [!@$^!^#!@@]",
              "type": "query_parsing_exception"
            },
            "index": "hoover-enron-pst",
            "shard": 0,
            "node": "0xO11SNzT_6xdw7Y2mMA4w"
          },
          {
            "reason": {
              "col": 50,
              "index": "hoover-test-data",
              "caused_by": {
                "type": "parse_exception",
                "caused_by": {
                  "type": "token_mgr_error",
                  "reason": "Lexical error at line 1, column 5. Encountered: \"!\" (33), after : \"\""
                },
                "reason": "Cannot parse '!@$^!^#!@@': Lexical error at line 1, column 5. Encountered: \"!\" (33), after : \"\""
              },
              "line": 1,
              "reason": "Failed to parse query [!@$^!^#!@@]",
              "type": "query_parsing_exception"
            },
            "index": "hoover-test-data",
            "shard": 0,
            "node": "0xO11SNzT_6xdw7Y2mMA4w"
          }
        ],
        "reason": "all shards failed",
        "grouped": true,
        "phase": "query",
        "root_cause": [
          {
            "col": 50,
            "index": "hoover-enron-pst",
            "line": 1,
            "type": "query_parsing_exception",
            "reason": "Failed to parse query [!@$^!^#!@@]"
          },
          {
            "col": 50,
            "index": "hoover-test-data",
            "line": 1,
            "type": "query_parsing_exception",
            "reason": "Failed to parse query [!@$^!^#!@@]"
          }
        ],
        "type": "search_phase_execution_exception"
      },
      "_query": {
        "query_string": {
          "query": "!@$^!^#!@@"
        }
      }
    }
  ]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment