Skip to content

Instantly share code, notes, and snippets.

@ruslanosipov
Created June 2, 2014 03:40
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save ruslanosipov/b748a138389db2cda1e8 to your computer and use it in GitHub Desktop.
Save ruslanosipov/b748a138389db2cda1e8 to your computer and use it in GitHub Desktop.
Script to convert WordPress posts to plain text files
#!/usr/bin/env python
"""This script converts WXR file to a number of plain text files.
WXR stands for "WordPress eXtended RSS", which basically is just a
regular XML file. This script extracts entries from the WXR file into
plain text files. Output format: article name prefixed by date for
posts, article name for pages.
Usage: wxr2txt.py filename [-o output_dir]
"""
import os
import re
import sys
from xml.etree import ElementTree
NAMESPACES = {
'content': 'http://purl.org/rss/1.0/modules/content/',
'wp': 'http://wordpress.org/export/1.2/',
}
USAGE_STRING = "Usage: wxr2txt.py filename [-o output_dir]"
def main(argv):
filename, output_dir = _parse_and_validate_output(argv)
try:
data = ElementTree.parse(filename).getroot()
except ElementTree.ParseError:
_error("Invalid input file format. Can not parse the input.")
page_counter, post_counter = 0, 0
for post in data.find('channel').findall('item'):
post_type = post.find('wp:post_type', namespaces=NAMESPACES).text
if post_type not in ('post', 'page'):
continue
content = post.find('content:encoded', namespaces=NAMESPACES).text
date = post.find('wp:post_date', namespaces=NAMESPACES).text
title = post.find('title').text
date = date.split(' ')[0].replace('-', '')
title = re.sub(r'[_]+', '_', re.sub(r'[^a-z0-9+]', '_', title.lower()))
if post_type == 'post':
post_filename = date + '_' + title + '.txt'
post_counter += 1
else:
post_filename = title + '.txt'
page_counter += 1
with open(os.path.join(output_dir, post_filename), 'w') as post_file:
post_file.write(content.encode('utf8'))
post_counter += 1
print "Saved {} posts and {} pages in directory '{}'.".format(
post_counter, page_counter, output_dir)
def _parse_and_validate_output(argv):
if len(argv) not in (2, 4):
_error("Wrong number of arguments.")
filename = argv[1]
if not os.path.isfile(filename):
_error("Input file does not exist (or not enough permissions).")
output_dir = argv[3] if len(argv) == 4 and argv[2] == '-o' else os.getcwd()
if not os.path.isdir(output_dir):
_error("Output directory does not exist (or not enough permissions).")
return filename, output_dir
def _error(text):
print text
print USAGE_STRING
sys.exit(1)
if __name__ == "__main__":
main(sys.argv)
@aharonium
Copy link

Thank you. I've been using your script to provide an alternative and accessible archive of our site's posts:

I've been struggling with capturing certain metadata: categories, tags, co-authors, as well as wp:postmeta data stored in a wp:meta_key/we:meta_value format. Here's my fork, so far.

@jaffermaniar
Copy link

This script works beautifully. It's exactly what I needed considering that I decided to take a site offline forever but keep all the content locally. Thank you.

On Mac osX:

  1. I moved both wxr2txt.py and example.xml to a desired folder.
  2. went to Terminal and cd ~/path/to/desired/folder and
  3. executed the command: python wxr2txt.py example.xml

and voila: all my post, drafts, pages appeared in over a hundred separate .txt files.

One shortcoming, of this script is you will need to clean up the HTML tags and comments such as <!-- wp:paragraph --> and <p> but a simple find and replace all with a <blank-space> in your text editor should do the trick.

@joeldcanfield
Copy link

after 3 errors regarding calling print without parentheses, I'm stuck on line 48 with the error

post_file.write(content.encode('utf8'))
TypeError: write() argument must be str, not bytes

I barely understand python, so I'm struggling through because as a WordPress command line geek, this would be marvelous to get working.

@gkv-ckultzow
Copy link

after 3 errors regarding calling print without parentheses, I'm stuck on line 48 with the error

post_file.write(content.encode('utf8')) TypeError: write() argument must be str, not bytes

I barely understand python, so I'm struggling through because as a WordPress command line geek, this would be marvelous to get working.

for this line:
with open(os.path.join(output_dir, post_filename), 'w') as post_file:
replace 'w' with 'wb'

@joeldcanfield
Copy link

Why? What does that mean and what does it do?

@gkv-ckultzow
Copy link

Why? What does that mean and what does it do?

I don't really know the details of "why is this actually a problem" but needed to get this script working and came across this post:

https://www.sharooq.com/solved-typeerror-write-argument-must-be-str-not-bytes-in-python

@joeldcanfield
Copy link

Excellent. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment