Skip to content

Instantly share code, notes, and snippets.

@polyfractal
Last active July 7, 2021 11:14
Show Gist options
  • Save polyfractal/a172cfdbdc93cadea5f1515b86a1c2e0 to your computer and use it in GitHub Desktop.
Save polyfractal/a172cfdbdc93cadea5f1515b86a1c2e0 to your computer and use it in GitHub Desktop.

General

  • G1GC is still buggy and eats more CPU, we don't really recommend it. For example, bugs like this are still being found (which is pretty scary when you think about it...who knows what that's doing under the covers to data structures): https://bugs.openjdk.java.net/browse/JDK-8148175 That's not to say G1 won't be a perfect match some day...but we don't think it's ready yet

  • Since you are on 2.0+ and can leverage doc values, I would adjust your heap size. I'd set your heap to 4gb and give the rest to the OS for file caching. Doc-values are off-heap and Lucene relies on the OS to cache them in the FS cache, so the more memory you can give the OS the better

  • Similarly, don't bother setting the field data cache...it isn't used anymore (doc values replace field data)

  • I'd bump the flush_size of Logstash to a number much larger than 500. Since each of these rows will be pretty small, 500 documents won't physically be very large. Bumping that up will allow ES to ammortize the cost of fsync's over more documents, etc. Will probably require some fiddling, but if we assume each doc is 1.6kb (50 fields * 32 bytes each), I'd start somewhere around 20,000. Which would give you a bulk of ~32mb.

  • I would not call _optimize during live indexing. Optimize forces Lucene to perform un-planned segment merges, which is very expensive and antagonistic to the merge process while you are still indexing. It does "clean up" some space because it merges down to fewer segments, but it does that at a cost of a very expensive streaming merge-sort. And it's ultimately rather moot, because the normal indexing process causes the index to continuosly merge small segments into larger segments, so the _optimize process happens naturally. I know why you did it (because space was tight), but hopefully the recommendations here will give you enough breathing room to avoid _optimize, which should help indexing perf a lot.

Mappings

So the goal with the mappings is to remove the "full-text" aspects that Elasticsearch/Lucene normally include. That adds some overhead to the disk usage, since ES is also creating inverted indices for search. Since the other systems don't have something like this, it should be safe to turn it off for a closer comparison.

Now, in general, we don't recommend disabling the full-text features: part of the power of ES is leveraging both search and analytics in the same system. But since you are cramped for space, it makes sense to disable them in the test.

Second note: I'm not sure if that ES-SQL adapter needs any of the full-text components...I don't know how it works under the covers, and how it converts SQL to ES DSL. I wouldn't think it needs any of the full-text search components, but I'm not positive. It shouldn't be needed for analytics over ES DSL anyhow :)

curl -XPUT localhost:9200/_template/pure_analytics
{
  "template" : "logstash-*",
  "settings" : {
    "number_of_shards" : 1,
    "number_of_replicas": 0,
    "index.translog.flush_threshold_size": "1g",
    "index.refresh_interval": -1
  },
  "mappings" : {
    "_default_" : {
      "_all" : {"enabled" : false},
      "_source": { "enabled": false },
      "dynamic_templates" : [
        {
          "doubles": {
            "match_mapping_type": "double",
            "mapping": {
              "type": "float"
            }
          }
        },
        {
          "strings": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "string",
              "index": "no",
              "doc_values": true
              }
            }
          }
      ],
      "properties" : {
        "@timestamp": { "type": "date" },
        "@version": { "type": "string", "index": "not_analyzed" },
        "pickup_datetime": { "type": "date" },
        "dropoff_datetime": { "type": "date" },
        "pickup_datetime": { "type": "date" }
      }
    }
  }
}

So what's going on here?

  • We setup an index template that is automatically applied to any new index that matches the logstash-* pattern. Can modify this to suit your test.

  • Use a single primary shard, zero replicas, disable refresh interval. Basically what you had in your test, optimizing for the single-node setup

  • translog.flush_threshold_size defaults to 512mb, we're bumping it to 1g. This allows ES to accumulate a larger translog before flushing it to segments (assuming Lucene doesn't do it automatically). This can sometimes help with the bulk-import scenario, where you are just trying to get as much data in as fast as possible. It has the downside of slowing node restarts, since it may have to replay the entire translog (but that's not a concern here)

  • We disable _all field. This is a special field that is the concatenation of all other fields. Used as a "catch all" search field in case you don't know where to search. Adds a lot to index size, since it's literally a second copy of the same data in a different field

  • We disable _source field. This is a copy of the original JSON that was sent to ES, often used to display search hits back to the user. If you are doing just pure analytics, no need to save it. NOTE: we don't generally recommend disabling this in "real" systems, since it is very useful to have the original source. But, again, optimizing disk usage for the test :)

  • Two dynamic templates.

    • The first makes sure all doubles are stored as floats. Floats and doubles are stored at their full byte-width in the column store (known as doc values), because floats/doubles cannot easily be compressed for random access (unlike ints/longs which can). So we make sure to store as float to minimize disk usage. This is the default starting in 2.3+

    • The second makes sure strings are not indexed for full-text search (index: no disables the inverted index, they are no longer "searchable"), but keeps them available for analytics by setting doc_values: true. Note: this is a pretty extreme approach, and will save quite a bit of disk and speed up indexing, but again we don't really recommend it for "real" systems. A more common approach is index: not_analyzed which keeps the fields "searchable" but not analyzed for full-text... e.g. you can filter/search on exact matches for enums

  • Then a few fields explicitly mapped, in this case the Logstash timestamp field, and the two dates in the data. You could also add the various geo points if you wanted, but I didn't bother for now (they'll be mapped as integers automatically by default).

@yehosef
Copy link

yehosef commented May 22, 2016

It seems you could also set "store": false for all the fields - as long as you're not querying the fields directly. Eg, if you are only getting the results from the aggregations, you don't need it and it will take up more space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment