My first python script, using pandoc to convert a webpage to markdown
#! /usr/bin/env python
# -*- coding: utf-8 -*-
# Requires: pyandoc
# (Change path to pandoc binary in before installing package)
# TODO: Preserve span formatting from original webpage
import urllib2
import pandoc
from bs4 import BeautifulSoup
# Get URLs for the desired webpages from a text file.
urls = open("urls.txt", "r").read()
urls = urls.splitlines()
# Loop through the URLs, converting each page to markdown.
for url in urls:
response = urllib2.urlopen(url)
webContent =
# Prepare the downloaded webContent for parsing with Beautiful Soup
soup = BeautifulSoup(webContent)
# Get the title of the post
rawTitle = soup.h2
title = str(rawTitle)
# Get the date from the post
rawDate = "Originally posted on " + soup.h3.string
date = str(rawDate)
# Get rid of the byline div in the post
byline = soup.find("div", class_="byline")
# Identify the blogPost section, which should now lack the byline
rawPost = soup.find("div", class_="blogPost")
# Had a lot of problems until I converted rawPost into string, which makes UTF-8
post = str(rawPost)
# Combine the title and the post body
fulltext = title + date + post
# Call on pandoc to convert fulltext to markdown and write to file
doc = pandoc.Document()
doc.html = fulltext
webConverted = doc.markdown
# Write to file, getting rid of any literal linebreaks
f = open('calebpost.txt','a').write(webConverted.replace("\\\n","\n"))

commented Oct 30, 2012

here you go:

start = webContent.find('<title>')+7
end = webContent.find('</title>')
rawTitle = webContent[start:end]

Then, you probably want to clean that title up:

rawTitle = rawTitle.replace('Offprints  &raquo; Blog Archive   &raquo; ','')
if rawTitle.find('&#8220;'):
    rawTitle = rawTitle.replace('&#8220;','')
    rawTitle = rawTitle.replace('&#8221;','')
title = rawTitle.replace(' ','_')

The replace method is a built-in string method that let's you, well, replace parts of a string. But, strings are immutable, so you have to re-assign the new string to your variable each time you use it to "save" the change. This little bit then gets rid of un-necessary parts of your html title, and the codes for typographic quote marks, and replaces the remaining spaces with underscores to your new file will be named: blog_post_title.txt

Then, you'd use this to name the new file:

f = open(title+'.txt', 'w').write(webConverted)

commented Oct 30, 2012

Also, in the interest of a little more brevity, you could make a few little changes to the code you have:

import urllib2, pandoc

url = ''
response = urllib2.urlopen(url).read()

# The first line creates a pandoc object. The second assigns the html response to that object
doc = pandoc.Document()
doc.html = response

f = open('wendell-phillips.txt','w').write(doc.markdown)

This way, the file is also automatically closed. That gives you six lines of code instead of eleven!


commented Oct 30, 2012

Thanks for the tips!

