Skip to content

Instantly share code, notes, and snippets.

@tomassedovic
Last active September 13, 2018 06:48
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tomassedovic/86c15e2c0ead6c3e0a99f5375240aa6a to your computer and use it in GitHub Desktop.
Save tomassedovic/86c15e2c0ead6c3e0a99f5375240aa6a to your computer and use it in GitHub Desktop.
# Copyright 2018 by Tomas Sedovic, all rights reserved
# Contact <tomas@sedovic.cz> for licensing options.
# NOTE: p=0.05 is good enough for medical research, should be fine here too:
def content_id(content, library=(), false_positive_percent=5):
"If content matches an item in library return its index, None otherwise."
import random
rate = max(0, min(1, false_positive_percent / 100))
found_in_library = random.random() <= rate
if library and found_in_library:
return random.randint(0, len(library) - 1)
# Usage:
library = ["Avengers", "Windows 10", "Helter Skelter", "Harry Potter"]
# NOTE: the library can also be a list of {name: name, data: full contents of the works} dicts.
# You can also supply the hashed contents to make the library smaller. The algorithm is very flexible.
>>> for _ in range(10): print(content_id(3.14159265358979323, library, 10))
...
None
None
None
None
None
None
3
None
None
None
@tomassedovic
Copy link
Author

With the EU copyright reform, everyone will need a filter that will tell them whether an uploaded piece of content matches copyrighted material.

This program is easy to deploy and uses modern technology as well as solid, well-understood computer science foundations to be fast and reliable (no crashes!). The current law does not specify the limit of false positives, but this algorithm is flexible enough to let you tweak it to suit your monetary or future compliance needs.

I am happy to discuss licensing options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment