Skip to content

Instantly share code, notes, and snippets.

@boogheta
Created April 11, 2017 09:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save boogheta/f8bc252f9d971a3f7537470089626059 to your computer and use it in GitHub Desktop.
Save boogheta/f8bc252f9d971a3f7537470089626059 to your computer and use it in GitHub Desktop.
Calcul de ratio de similarité entre 2 textes avec difflib en python.md
from difflib import SequenceMatcher
text1 = "Mais pourquoi la petite sirène est-elle aussi super, ce n'est pas comme les méchants poissons"
text2 = "Il était une fois une petite sirène super méchante qui mangeait des poissons"
matcher = SequenceMatcher(None, text1, text2)
blocks = matcher.get_matching_blocks()
for pos1, pos2, size in blocks:
    print(size, pos1, pos2, text1[pos1:pos1+size])
>>> 1 1 5 a
>>> 3 2 15 is 
>>> 1 7 18 u
>>> 15 16 21  petite sirène 
>>> 5 46 36 super
>>> 8 75 41  méchant
>>> 10 83 66 s poissons
>>> 0 93 76
ratio_similarity = float(sum([m.size for m in blocks])) / max(blocks[-1].a, blocks[-1].b)
ratio_modif = 1 - ratio_similarity
print(ratio_modif)
>>> 0.537634408602
# SequenceMatcher propose aussi des fonctions calculant un ratio mais nos essais étaient plus concluants avec cette formule :
print(1 - matcher.ratio())
>>> 0.491124260355
print(1 - matcher.quick_ratio())
>>> 0.207100591716
print(1 - matcher.real_quick_ratio())
>>> 0.100591715976
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment