Skip to content

Instantly share code, notes, and snippets.

@showa-yojyo
Created November 21, 2014 15:34
Show Gist options
  • Save showa-yojyo/57cb189f9e7979b3cd78 to your computer and use it in GitHub Desktop.
Save showa-yojyo/57cb189f9e7979b3cd78 to your computer and use it in GitHub Desktop.
Scrape "Appendix: Glossary of graph theory" (Wiktionary).
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""Scrape "Appendix: Glossary of graph theory" (Wiktionary)
Example:
$ scrape-graph-glossary.py | sort -d | uniq
"""
import re
import urllib.request
URL = 'http://en.wiktionary.org/wiki/Appendix:Glossary_of_graph_theory'
DT_PATTERN = re.compile("<dt>(.+?)</dt>", re.I)
TAG_PATTERN = re.compile(r'<.*?>')
def main():
with urllib.request.urlopen(URL) as fin:
text = fin.read().decode('utf-8')
termdefs = re.findall(DT_PATTERN, text)
for i in termdefs:
print(TAG_PATTERN.sub('', i.replace('&#160;', '')))
if __name__ == '__main__':
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment