Skip to content

Instantly share code, notes, and snippets.

@wcaleb
Created October 29, 2012 21:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wcaleb/3976662 to your computer and use it in GitHub Desktop.
Save wcaleb/3976662 to your computer and use it in GitHub Desktop.
My first python script, using pandoc to convert a webpage to markdown
#! /usr/bin/env python
# -*- coding: utf-8 -*-
# pandoc-webpage.py
# Requires: pyandoc http://pypi.python.org/pypi/pyandoc/
# (Change path to pandoc binary in core.py before installing package)
# TODO: Preserve span formatting from original webpage
import urllib2
import pandoc
from bs4 import BeautifulSoup
# Get URLs for the desired webpages from a text file.
urls = open("urls.txt", "r").read()
urls = urls.splitlines()
# Loop through the URLs, converting each page to markdown.
for url in urls:
response = urllib2.urlopen(url)
webContent = response.read()
# Prepare the downloaded webContent for parsing with Beautiful Soup
soup = BeautifulSoup(webContent)
# Get the title of the post
rawTitle = soup.h2
title = str(rawTitle)
# Get the date from the post
rawDate = "Originally posted on " + soup.h3.string
date = str(rawDate)
# Get rid of the byline div in the post
byline = soup.find("div", class_="byline")
byline.decompose()
# Identify the blogPost section, which should now lack the byline
rawPost = soup.find("div", class_="blogPost")
# Had a lot of problems until I converted rawPost into string, which makes UTF-8
post = str(rawPost)
# Combine the title and the post body
fulltext = title + date + post
# Call on pandoc to convert fulltext to markdown and write to file
doc = pandoc.Document()
doc.html = fulltext
webConverted = doc.markdown
# Write to file, getting rid of any literal linebreaks
f = open('calebpost.txt','a').write(webConverted.replace("\\\n","\n"))
@parezcoydigo
Copy link

here you go:

start = webContent.find('<title>')+7
end = webContent.find('</title>')
rawTitle = webContent[start:end]

Then, you probably want to clean that title up:

rawTitle = rawTitle.replace('Offprints  &raquo; Blog Archive   &raquo; ','')
if rawTitle.find('&#8220;'):
    rawTitle = rawTitle.replace('&#8220;','')
    rawTitle = rawTitle.replace('&#8221;','')
title = rawTitle.replace(' ','_')

The replace method is a built-in string method that let's you, well, replace parts of a string. But, strings are immutable, so you have to re-assign the new string to your variable each time you use it to "save" the change. This little bit then gets rid of un-necessary parts of your html title, and the codes for typographic quote marks, and replaces the remaining spaces with underscores to your new file will be named: blog_post_title.txt

Then, you'd use this to name the new file:

f = open(title+'.txt', 'w').write(webConverted)

@parezcoydigo
Copy link

Also, in the interest of a little more brevity, you could make a few little changes to the code you have:

import urllib2, pandoc

url = 'http://mcdaniel.blogs.rice.edu/?p=158'
response = urllib2.urlopen(url).read()

# The first line creates a pandoc object. The second assigns the html response to that object
doc = pandoc.Document()
doc.html = response

f = open('wendell-phillips.txt','w').write(doc.markdown)

This way, the file is also automatically closed. That gives you six lines of code instead of eleven!

@wcaleb
Copy link
Author

wcaleb commented Oct 30, 2012

Thanks for the tips!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment