My first python script, using pandoc to convert a webpage to markdown
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# pandoc-webpage.py | |
# Requires: pyandoc http://pypi.python.org/pypi/pyandoc/ | |
# (Change path to pandoc binary in core.py before installing package) | |
# TODO: Iterate over a list of webpages | |
# TODO: Clean up HTML by removing hard linebreaks | |
# TODO: Delete header and footer | |
import urllib2 | |
import pandoc | |
# Open the desired webpage | |
url = 'http://mcdaniel.blogs.rice.edu/?p=158' | |
response = urllib2.urlopen(url) | |
webContent = response.read() | |
# Call on pandoc to convert webContent to markdown and write to file | |
doc = pandoc.Document() | |
doc.html = webContent | |
webConverted = doc.markdown | |
f = open('wendell-phillips.txt','w') | |
f.write(webConverted) | |
f.close() |
Also, in the interest of a little more brevity, you could make a few little changes to the code you have:
import urllib2, pandoc
url = 'http://mcdaniel.blogs.rice.edu/?p=158'
response = urllib2.urlopen(url).read()
# The first line creates a pandoc object. The second assigns the html response to that object
doc = pandoc.Document()
doc.html = response
f = open('wendell-phillips.txt','w').write(doc.markdown)
This way, the file is also automatically closed. That gives you six lines of code instead of eleven!
Thanks for the tips!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
here you go:
Then, you probably want to clean that title up:
The
replace
method is a built-in string method that let's you, well, replace parts of a string. But, strings are immutable, so you have to re-assign the new string to your variable each time you use it to "save" the change. This little bit then gets rid of un-necessary parts of your html title, and the codes for typographic quote marks, and replaces the remaining spaces with underscores to your new file will be named: blog_post_title.txtThen, you'd use this to name the new file: