Skip to content

Instantly share code, notes, and snippets.

@HoffmannP
Last active February 22, 2024 09:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save HoffmannP/6ab67c7133edab5b3b01760a3f226a25 to your computer and use it in GitHub Desktop.
Save HoffmannP/6ab67c7133edab5b3b01760a3f226a25 to your computer and use it in GitHub Desktop.
#!/usr/bin/env python
import re
IGNORE_BEGINNING = True
allUppercase = re.compile(r'[A-ZÄÜÖ][A-ZÄÜÖ]+')
matchLetters = re.compile(r'[a-zäüößA-ZÄÜÖ]')
matchUCLetters = re.compile(r'[A-ZÄÜÖ]')
# Source https://www.biblesupersearch.com/bible-downloads/
for name in ['kjv.txt', 'luther.txt']:
letters = 0
ucletters = 0
with open(name) as f:
for line in f:
words = line.split(' ')
if len(words) < 4:
continue
first_word = 2
if ':' in words[first_word]:
first_word += 1
if words[first_word] == '¶':
first_word += 1
if IGNORE_BEGINNING:
first_word += 1
sentence = ' '.join(
filter(
lambda word: allUppercase.match(word) is None,
words[first_word:]))
letters += len(matchLetters.findall(sentence))
ucletters += len(matchUCLetters.findall(sentence))
book_name = name.split(".")[0]
percentage = 100 * ucletters / letters
print(f'{book_name}:\t{percentage:.1f}%\t({letters}/{ucletters})')
@HoffmannP
Copy link
Author

Added IGNORE_BEGINNING if you want to ignore short and empty lines that probably have more uppercase words and the beginning of a line that usually starts with a number and an uppercase word.

@HoffmannP
Copy link
Author

Results:
IGNORE_BEGINNING = False
kjv.txt
2087073
72887
3.4923071689394667
luther.txt
3150777
221274
7.022839128253126

IGNORE_BEGINNING = True
kjv.txt
1808897
50564
2.795294591123762
luther.txt
2684794
172202
6.4139744054851136

@HoffmannP
Copy link
Author

HoffmannP commented Feb 22, 2024

Changed the code to only count uppercase letters if they appear a the start of a word (don't count all CAP), using IGNORE_BEGINNING = True

kjv.txt
2087073
44015
2.108934378433337

luther.txt
3150777
170745
5.41913946940707

@HoffmannP
Copy link
Author

HoffmannP commented Feb 22, 2024

Updated cleaning of the lines again and downloaded another version (machine readable instead of plain text which has hard line breaks):

kjv:	1.9%	(3090749/57577)
luther:	5.2%	(2993877/156041)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment