| input { | |
| elasticsearch { | |
| hosts => [ "HOSTNAME_HERE" ] | |
| port => "9200" | |
| index => "INDEXNAME_HERE" | |
| size => 1000 | |
| scroll => "5m" | |
| docinfo => true | |
| scan => true | |
| } | |
| } | |
| output { | |
| elasticsearch { | |
| hosts => [ "HOSTNAME_HERE" ] | |
| index => "%{[@metadata][_index]}" | |
| document_type => "%{[@metadata][_type]}" | |
| document_id => "%{[@metadata][_id]}" | |
| } | |
| stdout { | |
| codec => "dots" | |
| } | |
| } |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
geekpete
commented
May 28, 2015
|
How fast does it go? Is it still a serial process or can it be parallelised? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
You can't do parallel (ie multiple worker threads) as it uses scroll :( |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
pickypg
commented
Jul 29, 2015
|
You could use multiple instances and use a filter to split the scrolled index. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
PR's welcome :p |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
angma7
Aug 21, 2015
I'm failing my query is like this.
query => '{"aggs":{"terms_keyword":{"terms":{"field":"user_keyword","size":0},"aggs":{"key_score":{"terms":{"field":"inquiry_id","size":0},"aggs":{"key_score":{"sum":{"field":"inquiry_score"}}}}}}}}'
it don't input destination index.
angma7
commented
Aug 21, 2015
|
I'm failing my query is like this. it don't input destination index. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
danievanzyl
commented
Aug 25, 2015
|
add to |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
markwalkom
Sep 2, 2015
Just a note on this; Changing the size in the input can be dangerous, as this is a scan/scroll it grabs that number of documents for all shards in an index.
So if you change this to 100000 and you have 10 shards, that is 1000000 documents, which will have an impact on heap use.
|
Just a note on this; Changing the So if you change this to 100000 and you have 10 shards, that is 1000000 documents, which will have an impact on heap use. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
FYI this has been updated for LS 2.0. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
PhaedrusTheGreek
Jan 27, 2016
Is it possible to reindex all indices in one shot, or do you have to specify each index name?
PhaedrusTheGreek
commented
Jan 27, 2016
|
Is it possible to reindex all indices in one shot, or do you have to specify each index name? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
thanthos
Feb 23, 2016
Using this method, I keep getting "Error: Unable to establish loopback connection" and using the metric to count the number of document returned, showed that it is not corresponding to the number of record in the original. I have to keep filtering till I get the size right. And FYI, if you keep the scroll value too long and if it encountered exception, it is going to eat into available memory as the search context will be kept open while the plugin restarts. Use this /_nodes/stats/indices/search?pretty to see the number of search context opened and you will find the open_context increasing. Would like to hear how others are handling such scenario.
thanthos
commented
Feb 23, 2016
|
Using this method, I keep getting "Error: Unable to establish loopback connection" and using the metric to count the number of document returned, showed that it is not corresponding to the number of record in the original. I have to keep filtering till I get the size right. And FYI, if you keep the scroll value too long and if it encountered exception, it is going to eat into available memory as the search context will be kept open while the plugin restarts. Use this /_nodes/stats/indices/search?pretty to see the number of search context opened and you will find the open_context increasing. Would like to hear how others are handling such scenario. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
blavoie
Mar 2, 2016
I have the same kind of config.
One notable difference is that I also use the metrics plugin to monitor throughput and number of copied documents.
I also added some filters to discard randomly entries by percentage, useful for copying from production to test (smaller clusters). Or simply CTRL + C if you don't need complete timeframe sample.
For maximum of parallelism, you can specify more than one non overlapping elasticsearch input.
As noted before, be cautious to tune correctly and consistently different input/worker/output parameters :
- LS_HEAP_SIZE
- Number of workers (-w switch).
- Number of elasticsearch input blocs (non overlapping index patterns)
- Elasticsearch output perf parameters (workers, flush_size, timeout)
There's no magic switch, it's always a matter of adjustment depending on your hardware and data volume.
Here's my file (sorry, it's french documented):
https://gist.github.com/blavoie/58c90290935e8e1167e6
Also, before copying, we create a end of line template (order 90) that disable refreshes and replicas for newly created indices. That makes bulk indexing faster. At the end of bulk, we remove template and re-enable desired replication factor and also set back refresh interval to better values (say 15s).
Link to this template:
https://gist.github.com/blavoie/ebdf92793e14d8d9ebfc
In our case, for simplicity and multi-tenancy reason, our indices begin all by ul-*.
This makes things simpler when copying stuff, applying templates, aliasing, etc.
Please note that this config is for LS < 2.2, as the workers/pools/pipeline model changed a bit as of LS >= 2.2.
blavoie
commented
Mar 2, 2016
|
I have the same kind of config. One notable difference is that I also use the metrics plugin to monitor throughput and number of copied documents. For maximum of parallelism, you can specify more than one non overlapping elasticsearch input. As noted before, be cautious to tune correctly and consistently different input/worker/output parameters :
There's no magic switch, it's always a matter of adjustment depending on your hardware and data volume. Here's my file (sorry, it's french documented): Also, before copying, we create a end of line template (order 90) that disable refreshes and replicas for newly created indices. That makes bulk indexing faster. At the end of bulk, we remove template and re-enable desired replication factor and also set back refresh interval to better values (say 15s). Link to this template: In our case, for simplicity and multi-tenancy reason, our indices begin all by ul-*. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Thanks for the comments @blavioe! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
JeremyColton
Jul 25, 2016
Hi, I use the default index naming "logstash-" for a daily index. I have altered the number of shards from the default 5 to 1. I need to re-index my indices. I don't want to re-index into a new index eg "logstash-new-" but instead I want the existing indices to end up being spread across their single shard (instead of the current 5 shards per index).
How can I use this logstash script to do this?
Is there a better way to do this - eg re-index into new indices eg "logstash-new-", delete the original "logstash-" indices, then re-index back into "logstash-" from the new "logstash-new-" indices?
Many thanks.
JeremyColton
commented
Jul 25, 2016
•
|
Hi, I use the default index naming "logstash-" for a daily index. I have altered the number of shards from the default 5 to 1. I need to re-index my indices. I don't want to re-index into a new index eg "logstash-new-" but instead I want the existing indices to end up being spread across their single shard (instead of the current 5 shards per index). How can I use this logstash script to do this? Is there a better way to do this - eg re-index into new indices eg "logstash-new-", delete the original "logstash-" indices, then re-index back into "logstash-" from the new "logstash-new-" indices? Many thanks. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
geekpete
May 4, 2017
Reindex API is a nice option:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html#_reindex_daily_indices
Also look into automatic scroll slicing that allows scrolls to be processed by multiple threads in parallel giving a nice speed boost.
geekpete
commented
May 4, 2017
|
Reindex API is a nice option: Also look into automatic scroll slicing that allows scrolls to be processed by multiple threads in parallel giving a nice speed boost. |
How fast does it go?
Is it still a serial process or can it be parallelised?