Skip to content

Instantly share code, notes, and snippets.

@Glench
Last active January 8, 2021 08:32
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Glench/4627325 to your computer and use it in GitHub Desktop.
Save Glench/4627325 to your computer and use it in GitHub Desktop.
A command-line script to find the common tropes of two or more media items. To use, you need to install the python libraries 'pattern' and 'pyquery'. Then use like this, passing in media names or links to tv tropes pages: > python tv_tropes_common_tropes.py 'My Little Pony Friendship is Magic' 'Hamlet'
#!/usr/bin/python
# a script to get all the common tropes for media from tv tropes
# usage:
# python tv_tropes_matcher.py name1 name2 [name3...nameN]
# please put names with spaces or special characters in quotes
# you can also pass in the urls if it won't automatch by name.
# pip install pattern
# pip install pyquery
import sys
from pprint import pprint
import re
import urllib
from pattern import web
from pyquery import PyQuery
names = sys.argv[1:]
spider_regex = re.compile(r'[A-Z](To|-)[A-Z]$')
queries = ['#wikitext > ul > li > a:first-child', '#wikitext div > ul > li > a:first-child', '#wikitext > ul > li > ul a:first-child']
trope_urls = {}
def get_tropes_by_url(url):
page = web.URL(url).download()
pq_page = PyQuery(page)
print 'Page title:', pq_page('title').text()
tropes = set()
for query in queries:
if len(tropes) < 3:
for a in pq_page(query):
pq_a = PyQuery(a)
if spider_regex.search(pq_a.attr('href')):
tropes = tropes.union(get_tropes_by_url(pq_a.attr('href')))
else:
trope_urls[pq_a.text()] = pq_a.attr('href')
tropes.add(pq_a.text())
return tropes
def get_tropes(name):
# TODO: turn name into url somehow
if 'http:' in name:
url = name
else:
url = 'http://www.google.com/search?ie=UTF-8&oe=UTF-8&sourceid=navclient&gfns=1&q={}'.format(urllib.quote('tv tropes ' + name))
print url
return get_tropes_by_url(url)
def trope_intersection(tropes1, tropes2):
return tropes1.intersection(tropes2)
if len(names) > 1:
common_tropes = reduce(trope_intersection, (get_tropes(name) for name in names))
if common_tropes:
print 'Common matches are: {}'.format(len(common_tropes))
for trope in common_tropes:
print '\t', trope #, '\t', trope_urls[trope]
else:
print 'There are no common tropes!'
else:
print 'Please enter enter 2 or more shows/movies/books/etc'
sys.exit(1)
@DonaldTsang
Copy link

DonaldTsang commented Jan 7, 2021

There are some issues with a naive "direct page" scraper, since:

  • It neglects the character, TearJerker, Trivia, and YMMV pages
  • It assumes that all information will be on one page (see: Homestuck)

Also, there is an idea that this tool can be generalized to compare multiple pieces of media.
Reference to a triple-media analysis proposal: rhgarcia/tropescraper#11

@Glench
Copy link
Author

Glench commented Jan 7, 2021

ok! I wrote this as a one-off tool during the MIT mystery hunt, so I'm not surprised it's incomplete. Hope you still found it useful — that's why I put it up!

@DonaldTsang
Copy link

DonaldTsang commented Jan 8, 2021

Also some questions:

  1. Why does this not written with BeautifulSoup? Which tutorial do you think is "da best" for writing it?
  2. Would you consider the option that the other pages on the same medium are "useful"?

I am still screaming about the CSS tags being absurd and annoying. OUCH

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment