Skip to content

Instantly share code, notes, and snippets.

@dpenfoldbrown
Created August 28, 2013 18:27
Show Gist options
  • Save dpenfoldbrown/6369488 to your computer and use it in GitHub Desktop.
Save dpenfoldbrown/6369488 to your computer and use it in GitHub Desktop.
Regex URLs to determine political leaning via labelled sources
# List of urls (pretend like it's populated)
urls = []
# Patterns to match in urls (note in some cases including the .org or .com to avoid matching common words or letters
# (eg for npr or slate or today)
# Add whatever other domains you want to match to the re OR (|) string
left_pattern = r"(?P<domain>nytimes|washingtonpost|npr.org|abcnews|nbcnews|huffingtonpost|slate.com|today.com)"
center_pattern = r"(?P<domain>cnn|bbc.co.uk|yahoo)"
right_pattern = r"(?P<domain>foxnews|washingtontimes|usnews|chicagotribune)"
# Keep counts in dictionary
affiliation_count = {"left": 0, "right": 0, "center": 0, "unknown": 0}
for url in urls:
if re.search(left_pattern, url):
# Add "left" annotation to URL object in database via PYMONGO
affiliation_count["left"] += 1
elif re.search(right_pattern, url):
# Add "right" annotation to URL object in database via PYMONGO
affiliation_count["right"] += 1
elif re.search(center_pattern, url):
# Add "center" annotation to URL object in database via PYMONGO
affiliation_count["center"] += 1
else:
# Add "unknown" annotation to URL object in database via PYMONGO
affiliation_count["unknown"] += 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment