Skip to content

Instantly share code, notes, and snippets.

@dustinboswell
Last active November 6, 2020 17:58
Show Gist options
  • Save dustinboswell/1f9bebf0acd5890edf88f7e7c589cedc to your computer and use it in GitHub Desktop.
Save dustinboswell/1f9bebf0acd5890edf88f7e7c589cedc to your computer and use it in GitHub Desktop.
Deduplicating a result set using shingleprints
seen_shingleprints = set()
for doc in search_results:
if any(shingleprint in seen_shingleprints for shingleprint in doc.shingleprints):
continue # doc has at least 1 already-seen shingleprint, so skip it
final_results.append(doc)
seen_shingleprints.update(doc.shingleprints)
@Morriaty-The-Murderer
Copy link

Morriaty-The-Murderer commented Nov 6, 2020

the continue only affect the inner for-loop, maybe should be modified like this

seen_shingleprints = set()
for doc in search_results:
    skipped = False
    for shingleprint in doc.shingleprints:
        if shingleprint in seen_shingleprints:
            # doc is a near-duplicate, skip it
            skipped = True
            break
    if not skipped:
        final_results.append(doc)
    seen_shingleprints.update(doc.shingleprints)

@dustinboswell
Copy link
Author

Good catch - thanks for letting me know! I switched to using any(), which keeps the code more succinct.

@dustinboswell
Copy link
Author

Also note that if the document is skipped, its shingleprints shouldn't be added to the seen_shingleprints. (Your last line should be indented to be inside the if: ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment