Skip to content

Instantly share code, notes, and snippets.

@jordanmessina
Created November 29, 2012 18:18
Show Gist options
  • Save jordanmessina/4170908 to your computer and use it in GitHub Desktop.
Save jordanmessina/4170908 to your computer and use it in GitHub Desktop.
When given a blog rss feed, count_the_Is.py determines the number of first-person pronouns as a percentage of the total text.
#!/usr/bin/env python
import sys
import feedparser
from pyquery import PyQuery as pq
FIRST_PERSON_PRONOUNS = [
' I ', ' I\'d ', ' I\'m ',
' me ', ' my ', ' mine ', ' myself ', ' Me ', ' My ', ' Mine ', ' Myself ',
]
def occurrences(string, sub):
"""Count the occurences of a substring in a string, allowing overlaps"""
count = start = 0
while True:
start = string.find(sub, start) + 1
if start > 0:
count+=1
else:
return count
def main():
if len(sys.argv) != 2:
print "Wrong number of arguments, please supply the RSS feed url as a command line argument"
sys.exit()
feed = feedparser.parse(sys.argv[1])
for item in feed['items']:
if 'content' in item:
content = item.content[0]['value']
elif 'summary' in item:
content = item.summary
content_text = pq(content).text()
i_count = sum([occurrences(content_text, x) for x in FIRST_PERSON_PRONOUNS])
word_count = len(content_text.split())
print "{title}\nTotal 'I' count: {total}\nTotal word count: {word_count}\nAvg: {avg}\n\n".format(
title=item.title.encode('UTF-8'),
total=i_count,
word_count=word_count,
avg=float(i_count)/float(word_count)
)
if __name__ == '__main__':
main()
@jordanmessina
Copy link
Author

Some examples:

Tim Ferris
$ ./count_the_is.py http://feeds.feedburner.com/TimFerriss
The 4-Hour Chef Launch — Marketing/PR Summary of Week One
Total 'I' count: 13
Total word count: 1497
Avg: 0.00868403473614

Food Photography Made Easy — Simple Tricks and Pro Tips from The 4-Hour Chef
Total 'I' count: 29
Total word count: 1730
Avg: 0.0167630057803

What to Do If Boycotted by 1,000+ Bookstores? Open Your Own Bookstores, Of Course.
Total 'I' count: 9
Total word count: 255
Avg: 0.0352941176471

Meet The New York City Food Marathon: 26.2 Dishes in 26 Locations in 24 Hours
Total 'I' count: 0
Total word count: 105
Avg: 0.0

The 4-Hour Chef is LIVE — Dr. Oz, NYC Cabs, TaskRabbit, London, and More
Total 'I' count: 5
Total word count: 352
Avg: 0.0142045454545


37 Signals
$ ./count_the_is.py http://feeds.feedburner.com/37signals/beMH
Pattern vision
Total 'I' count: 0
Total word count: 410
Avg: 0.0

Deconstructing the Cityscape
Total 'I' count: 25
Total word count: 402
Avg: 0.0621890547264

VIDEO: Inspiring talk by Adam Savage of Mythbusters…
Total 'I' count: 0
Total word count: 12
Avg: 0.0

Cities with signals
Total 'I' count: 0
Total word count: 128
Avg: 0.0

Tablets are waiting for their Movable Type
Total 'I' count: 3
Total word count: 205
Avg: 0.0146341463415

Publishers shouldn't be app developers
Total 'I' count: 0
Total word count: 228
Avg: 0.0

VIDEO: A really fun and smart TEDx talk by Rodney…
Total 'I' count: 3
Total word count: 77
Avg: 0.038961038961

Seeing the world, on the clock
Total 'I' count: 40
Total word count: 537
Avg: 0.0744878957169

The British are coming!
Total 'I' count: 0
Total word count: 185
Avg: 0.0

Better remote collaboration will make protectionism harder
Total 'I' count: 2
Total word count: 413
Avg: 0.00484261501211


Fred Wilson
$ ./count_the_is.py http://feeds.feedburner.com/avc
Media Metrix Multi Platform
Total 'I' count: 7
Total word count: 453
Avg: 0.0154525386313

The # Discover Tab
Total 'I' count: 10
Total word count: 134
Avg: 0.0746268656716

MBA Mondays: The Revenue Model Hackpad, Take Two
Total 'I' count: 15
Total word count: 232
Avg: 0.0646551724138

MBA Mondays: The Revenue Model Hackpad
Total 'I' count: 26
Total word count: 426
Avg: 0.0610328638498

What Has Changed
Total 'I' count: 5
Total word count: 970
Avg: 0.00515463917526

The Flow and The Balance
Total 'I' count: 12
Total word count: 434
Avg: 0.0276497695853

How Boxee Saved Our Thanksgiving (And How The Jets Ruined It)
Total 'I' count: 17
Total word count: 317
Avg: 0.0536277602524

Giving Thanks
Total 'I' count: 14
Total word count: 175
Avg: 0.08

The Missing Ad Unit
Total 'I' count: 15
Total word count: 347
Avg: 0.0432276657061

CSEdWeek
Total 'I' count: 6
Total word count: 281
Avg: 0.0213523131673

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment