Skip to content

Instantly share code, notes, and snippets.

@magnusnissel
Last active July 25, 2022 08:07
Show Gist options
  • Save magnusnissel/d9521cb78b9ae0b2c7d6 to your computer and use it in GitHub Desktop.
Save magnusnissel/d9521cb78b9ae0b2c7d6 to your computer and use it in GitHub Desktop.
Yule's K and Yule's I for lexical diversity in Python 3 (quick & dirty)
import collections
import re
def tokenize(s):
tokens = re.split(r"[^0-9A-Za-z\-'_]+", s)
return tokens
def get_yules(s):
"""
Returns a tuple with Yule's K and Yule's I.
(cf. Oakes, M.P. 1998. Statistics for Corpus Linguistics.
International Journal of Applied Linguistics, Vol 10 Issue 2)
In production this needs exception handling.
"""
tokens = tokenize(s)
token_counter = collections.Counter(tok.upper() for tok in tokens)
m1 = sum(token_counter.values())
m2 = sum([freq ** 2 for freq in token_counter.values()])
i = (m1*m1) / (m2-m1)
k = 1/i * 10000
return (k, i)
@KarimJedda
Copy link

Also, k = 10000/i 👍

@magnusnissel
Copy link
Author

Thanks for pointing it out, forgot to remove the self after excerpting it from a class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment