Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Reindexing Elasticsearch with Logstash 2.0
input {
elasticsearch {
hosts => [ "HOSTNAME_HERE" ]
port => "9200"
index => "INDEXNAME_HERE"
size => 1000
scroll => "5m"
docinfo => true
scan => true
}
}
output {
elasticsearch {
hosts => [ "HOSTNAME_HERE" ]
index => "%{[@metadata][_index]}"
document_type => "%{[@metadata][_type]}"
document_id => "%{[@metadata][_id]}"
}
stdout {
codec => "dots"
}
}
@geekpete

This comment has been minimized.

Copy link

commented May 28, 2015

How fast does it go?

Is it still a serial process or can it be parallelised?

@markwalkom

This comment has been minimized.

Copy link
Owner Author

commented Jun 8, 2015

You can't do parallel (ie multiple worker threads) as it uses scroll :(

@pickypg

This comment has been minimized.

Copy link

commented Jul 29, 2015

You could use multiple instances and use a filter to split the scrolled index.

@markwalkom

This comment has been minimized.

Copy link
Owner Author

commented Aug 7, 2015

PR's welcome :p

@angma7

This comment has been minimized.

Copy link

commented Aug 21, 2015

I'm failing my query is like this.
query => '{"aggs":{"terms_keyword":{"terms":{"field":"user_keyword","size":0},"aggs":{"key_score":{"terms":{"field":"inquiry_id","size":0},"aggs":{"key_score":{"sum":{"field":"inquiry_score"}}}}}}}}'

it don't input destination index.

@danievanzyl

This comment has been minimized.

Copy link

commented Aug 25, 2015

add to
output {
workers => 5
}

@markwalkom

This comment has been minimized.

Copy link
Owner Author

commented Sep 2, 2015

Just a note on this; Changing the size in the input can be dangerous, as this is a scan/scroll it grabs that number of documents for all shards in an index.

So if you change this to 100000 and you have 10 shards, that is 1000000 documents, which will have an impact on heap use.

@markwalkom

This comment has been minimized.

Copy link
Owner Author

commented Nov 20, 2015

FYI this has been updated for LS 2.0.

@PhaedrusTheGreek

This comment has been minimized.

Copy link

commented Jan 27, 2016

Is it possible to reindex all indices in one shot, or do you have to specify each index name?

@thanthos

This comment has been minimized.

Copy link

commented Feb 23, 2016

Using this method, I keep getting "Error: Unable to establish loopback connection" and using the metric to count the number of document returned, showed that it is not corresponding to the number of record in the original. I have to keep filtering till I get the size right. And FYI, if you keep the scroll value too long and if it encountered exception, it is going to eat into available memory as the search context will be kept open while the plugin restarts. Use this /_nodes/stats/indices/search?pretty to see the number of search context opened and you will find the open_context increasing. Would like to hear how others are handling such scenario.

@blavoie

This comment has been minimized.

Copy link

commented Mar 2, 2016

I have the same kind of config.

One notable difference is that I also use the metrics plugin to monitor throughput and number of copied documents.
I also added some filters to discard randomly entries by percentage, useful for copying from production to test (smaller clusters). Or simply CTRL + C if you don't need complete timeframe sample.

For maximum of parallelism, you can specify more than one non overlapping elasticsearch input.

As noted before, be cautious to tune correctly and consistently different input/worker/output parameters :

  • LS_HEAP_SIZE
  • Number of workers (-w switch).
  • Number of elasticsearch input blocs (non overlapping index patterns)
  • Elasticsearch output perf parameters (workers, flush_size, timeout)

There's no magic switch, it's always a matter of adjustment depending on your hardware and data volume.

Here's my file (sorry, it's french documented):
https://gist.github.com/blavoie/58c90290935e8e1167e6

Also, before copying, we create a end of line template (order 90) that disable refreshes and replicas for newly created indices. That makes bulk indexing faster. At the end of bulk, we remove template and re-enable desired replication factor and also set back refresh interval to better values (say 15s).

Link to this template:
https://gist.github.com/blavoie/ebdf92793e14d8d9ebfc

In our case, for simplicity and multi-tenancy reason, our indices begin all by ul-*.
This makes things simpler when copying stuff, applying templates, aliasing, etc.
Please note that this config is for LS < 2.2, as the workers/pools/pipeline model changed a bit as of LS >= 2.2.

@markwalkom

This comment has been minimized.

Copy link
Owner Author

commented Mar 18, 2016

Thanks for the comments @blavioe!

@JeremyColton

This comment has been minimized.

Copy link

commented Jul 25, 2016

Hi, I use the default index naming "logstash-" for a daily index. I have altered the number of shards from the default 5 to 1. I need to re-index my indices. I don't want to re-index into a new index eg "logstash-new-" but instead I want the existing indices to end up being spread across their single shard (instead of the current 5 shards per index).

How can I use this logstash script to do this?

Is there a better way to do this - eg re-index into new indices eg "logstash-new-", delete the original "logstash-" indices, then re-index back into "logstash-" from the new "logstash-new-" indices?

Many thanks.

@geekpete

This comment has been minimized.

Copy link

commented May 4, 2017

Reindex API is a nice option:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html#_reindex_daily_indices

Also look into automatic scroll slicing that allows scrolls to be processed by multiple threads in parallel giving a nice speed boost.

@ksemaev

This comment has been minimized.

Copy link

commented Feb 22, 2019

Can anybody please explain that scroll option? I do reindex with logstash and it loops endlessly - the data from source index is randomly duplicated to output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.