Skip to content

Instantly share code, notes, and snippets.

@jthomerson
Created April 24, 2017 14:50
Show Gist options
  • Star 16 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save jthomerson/ca06245d316d485252579a7d42630095 to your computer and use it in GitHub Desktop.
Save jthomerson/ca06245d316d485252579a7d42630095 to your computer and use it in GitHub Desktop.
Delete all CloudSearch documents in a given domain
#!/bin/bash
# This script will delete *all* documents in a CloudSearch domain.
# USE WITH EXTREME CAUTION
# Note: depends on the AWS CLI SDK being installed, as well as jq
# For jq, see: https://stedolan.github.io/jq/ and https://jqplay.org/
if [[ ! $# -eq 2 || $1 != "--doc-domain" || ! $2 =~ ^https://.*$ ]]; then
echo "Must define --doc-domain argument (e.g. --doc-domain https://somedomain.aws.com)";
exit 1;
fi
CS_DOMAIN=$2
TMP_DELETE_FILE=/tmp/delete-all-cloudsearch-documents.json
TMP_RESULTS_FILE=/tmp/delete-all-cloudsearch-documents-tmp-results.json
while [ 1 -eq 1 ]; do
aws cloudsearchdomain search \
--endpoint-url=$CS_DOMAIN \
--size=10000 \
--query-parser=structured \
--search-query="matchall" > ${TMP_RESULTS_FILE}
cat ${TMP_RESULTS_FILE} | jq '[.hits.hit[] | {type: "delete", id: .id}]' > ${TMP_DELETE_FILE}
CNT_TOTAL=$(cat ${TMP_RESULTS_FILE} | jq '.hits.found')
CNT_DOCS=$(cat ${TMP_DELETE_FILE} | jq '. | length')
if [[ $CNT_DOCS -gt 0 ]]; then
echo "About to delete ${CNT_DOCS} documents of ${CNT_TOTAL} total in index"
aws cloudsearchdomain upload-documents \
--endpoint-url=$CS_DOMAIN \
--content-type='application/json' \
--documents=${TMP_DELETE_FILE}
else
echo "No more docs to delete"
exit 0
fi
done
@jthomerson
Copy link
Author

See http://stackoverflow.com/questions/17557295/how-to-clear-all-data-from-aws-cloudsearch for background, and alternatives in other languages.

@holmberd
Copy link

holmberd commented May 4, 2017

Thanks will give it a try.

@dmhendricks
Copy link

This worked perfectly. Thank you and *hugs*

@winzig
Copy link

winzig commented Apr 20, 2018

It worked, so thank you for that. But not sure I understand the output:

About to delete 340 documents of 340 total in index
{
    "status": "success",
    "adds": 0,
    "deletes": 340
}
About to delete 340 documents of 340 total in index
{
    "status": "success",
    "adds": 0,
    "deletes": 340
}
About to delete 340 documents of 340 total in index
{
    "status": "success",
    "adds": 0,
    "deletes": 340
}
About to delete 340 documents of 340 total in index
{
    "status": "success",
    "adds": 0,
    "deletes": 340
}

Was expecting if the number was under 10,000 it would do them in one shot based on a quick skim of the code. Or if it was doing in batches then the "of XXX total in index" would be higher than the total documents being deleted at a time?

@antonyakushin
Copy link

@winzig were the documents actually deleted in your case? The loop in the code should have found 0 documents at the second iteration and stopped.

I found this thread while diagnosing aws cloudsearchdomain upload-documents, which for me reports successfully deleting 10000 documents but doesn't actually do so. This seems consistent with your output as the documents are not being deleted so the loop keeps going.

@bizsimon
Copy link

bizsimon commented Sep 9, 2020

worked like a charm. thank you so much 👍

@mafritsch
Copy link

I tried the script (only the search part) but it did not work, because the result file was not in Json format. It says "Invalid numeric literal at line 1, column 5". And the file starts with:

HITS    8581    0
HIT     b187f653b61b08e5ee5f54c662b280e4ad368f5c1d631e32ce3b2cbf31c81ae4ba4b39360fc859fa364da32788549a5543fd4efb734f12438e3a4b4238bc5212
BOOK    Studio ASDoc
BOOST   0.029440219

So, how do I get Cloudsearch to deliver a JSON file as the response?

@jthomerson
Copy link
Author

@mafritsch did you try aws help and set the --output json option?

@mafritsch
Copy link

@jthomerson Thank you very much. --output json was the missing option!

@rushimusmaximus
Copy link

rushimusmaximus commented Feb 12, 2024

Thanks so much for making and publishing this script. Very helpful! In my case I added a filter query to target specific data.

I am deleting millions of records and am finding that it stops periodically, saying there are no more records to delete. I haven't researched the cause yet, but suspect it's an "eventual consistency" issue, thinking the same record ids are being returned that were just deleted. Was wondering if paginating the results with a cursor might help.

@rushimusmaximus
Copy link

I am deleting millions of records and am finding that it stops periodically, saying there are no more records to delete.

Results looked like:

    "hits": {
        "found": 9803889,
        "start": 0,
        "hit": [ ]

(no records listed in the hit list)

I found that adding a random sort really helped (--sort="_rand asc"). There may be a better solution, but this was an easy change that helped in my case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment