Skip to content

Instantly share code, notes, and snippets.

@markwalkom
Last active April 29, 2022 10:23
Show Gist options
  • Save markwalkom/8a7201e3f6ea4354ae06 to your computer and use it in GitHub Desktop.
Save markwalkom/8a7201e3f6ea4354ae06 to your computer and use it in GitHub Desktop.
Reindexing Elasticsearch with Logstash 2.0
input {
elasticsearch {
hosts => [ "HOSTNAME_HERE" ]
port => "9200"
index => "INDEXNAME_HERE"
size => 1000
scroll => "5m"
docinfo => true
scan => true
}
}
output {
elasticsearch {
hosts => [ "HOSTNAME_HERE" ]
index => "%{[@metadata][_index]}"
document_type => "%{[@metadata][_type]}"
document_id => "%{[@metadata][_id]}"
}
stdout {
codec => "dots"
}
}
@geekpete
Copy link

How fast does it go?

Is it still a serial process or can it be parallelised?

@markwalkom
Copy link
Author

You can't do parallel (ie multiple worker threads) as it uses scroll :(

@pickypg
Copy link

pickypg commented Jul 29, 2015

You could use multiple instances and use a filter to split the scrolled index.

@markwalkom
Copy link
Author

PR's welcome :p

@angma7
Copy link

angma7 commented Aug 21, 2015

I'm failing my query is like this.
query => '{"aggs":{"terms_keyword":{"terms":{"field":"user_keyword","size":0},"aggs":{"key_score":{"terms":{"field":"inquiry_id","size":0},"aggs":{"key_score":{"sum":{"field":"inquiry_score"}}}}}}}}'

it don't input destination index.

@danievanzyl
Copy link

add to
output {
workers => 5
}

@markwalkom
Copy link
Author

Just a note on this; Changing the size in the input can be dangerous, as this is a scan/scroll it grabs that number of documents for all shards in an index.

So if you change this to 100000 and you have 10 shards, that is 1000000 documents, which will have an impact on heap use.

@markwalkom
Copy link
Author

FYI this has been updated for LS 2.0.

@PhaedrusTheGreek
Copy link

Is it possible to reindex all indices in one shot, or do you have to specify each index name?

@thanthos
Copy link

Using this method, I keep getting "Error: Unable to establish loopback connection" and using the metric to count the number of document returned, showed that it is not corresponding to the number of record in the original. I have to keep filtering till I get the size right. And FYI, if you keep the scroll value too long and if it encountered exception, it is going to eat into available memory as the search context will be kept open while the plugin restarts. Use this /_nodes/stats/indices/search?pretty to see the number of search context opened and you will find the open_context increasing. Would like to hear how others are handling such scenario.

@blavoie
Copy link

blavoie commented Mar 2, 2016

I have the same kind of config.

One notable difference is that I also use the metrics plugin to monitor throughput and number of copied documents.
I also added some filters to discard randomly entries by percentage, useful for copying from production to test (smaller clusters). Or simply CTRL + C if you don't need complete timeframe sample.

For maximum of parallelism, you can specify more than one non overlapping elasticsearch input.

As noted before, be cautious to tune correctly and consistently different input/worker/output parameters :

  • LS_HEAP_SIZE
  • Number of workers (-w switch).
  • Number of elasticsearch input blocs (non overlapping index patterns)
  • Elasticsearch output perf parameters (workers, flush_size, timeout)

There's no magic switch, it's always a matter of adjustment depending on your hardware and data volume.

Here's my file (sorry, it's french documented):
https://gist.github.com/blavoie/58c90290935e8e1167e6

Also, before copying, we create a end of line template (order 90) that disable refreshes and replicas for newly created indices. That makes bulk indexing faster. At the end of bulk, we remove template and re-enable desired replication factor and also set back refresh interval to better values (say 15s).

Link to this template:
https://gist.github.com/blavoie/ebdf92793e14d8d9ebfc

In our case, for simplicity and multi-tenancy reason, our indices begin all by ul-*.
This makes things simpler when copying stuff, applying templates, aliasing, etc.
Please note that this config is for LS < 2.2, as the workers/pools/pipeline model changed a bit as of LS >= 2.2.

@markwalkom
Copy link
Author

Thanks for the comments @blavioe!

@JeremyColton
Copy link

JeremyColton commented Jul 25, 2016

Hi, I use the default index naming "logstash-" for a daily index. I have altered the number of shards from the default 5 to 1. I need to re-index my indices. I don't want to re-index into a new index eg "logstash-new-" but instead I want the existing indices to end up being spread across their single shard (instead of the current 5 shards per index).

How can I use this logstash script to do this?

Is there a better way to do this - eg re-index into new indices eg "logstash-new-", delete the original "logstash-" indices, then re-index back into "logstash-" from the new "logstash-new-" indices?

Many thanks.

@geekpete
Copy link

geekpete commented May 4, 2017

Reindex API is a nice option:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html#_reindex_daily_indices

Also look into automatic scroll slicing that allows scrolls to be processed by multiple threads in parallel giving a nice speed boost.

@ksemaev
Copy link

ksemaev commented Feb 22, 2019

Can anybody please explain that scroll option? I do reindex with logstash and it loops endlessly - the data from source index is randomly duplicated to output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment