Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Script to convert WordPress posts to plain text files
#!/usr/bin/env python
"""This script converts WXR file to a number of plain text files.
WXR stands for "WordPress eXtended RSS", which basically is just a
regular XML file. This script extracts entries from the WXR file into
plain text files. Output format: article name prefixed by date for
posts, article name for pages.
Usage: wxr2txt.py filename [-o output_dir]
"""
import os
import re
import sys
from xml.etree import ElementTree
NAMESPACES = {
'content': 'http://purl.org/rss/1.0/modules/content/',
'wp': 'http://wordpress.org/export/1.2/',
}
USAGE_STRING = "Usage: wxr2txt.py filename [-o output_dir]"
def main(argv):
filename, output_dir = _parse_and_validate_output(argv)
try:
data = ElementTree.parse(filename).getroot()
except ElementTree.ParseError:
_error("Invalid input file format. Can not parse the input.")
page_counter, post_counter = 0, 0
for post in data.find('channel').findall('item'):
post_type = post.find('wp:post_type', namespaces=NAMESPACES).text
if post_type not in ('post', 'page'):
continue
content = post.find('content:encoded', namespaces=NAMESPACES).text
date = post.find('wp:post_date', namespaces=NAMESPACES).text
title = post.find('title').text
date = date.split(' ')[0].replace('-', '')
title = re.sub(r'[_]+', '_', re.sub(r'[^a-z0-9+]', '_', title.lower()))
if post_type == 'post':
post_filename = date + '_' + title + '.txt'
post_counter += 1
else:
post_filename = title + '.txt'
page_counter += 1
with open(os.path.join(output_dir, post_filename), 'w') as post_file:
post_file.write(content.encode('utf8'))
post_counter += 1
print "Saved {} posts and {} pages in directory '{}'.".format(
post_counter, page_counter, output_dir)
def _parse_and_validate_output(argv):
if len(argv) not in (2, 4):
_error("Wrong number of arguments.")
filename = argv[1]
if not os.path.isfile(filename):
_error("Input file does not exist (or not enough permissions).")
output_dir = argv[3] if len(argv) == 4 and argv[2] == '-o' else os.getcwd()
if not os.path.isdir(output_dir):
_error("Output directory does not exist (or not enough permissions).")
return filename, output_dir
def _error(text):
print text
print USAGE_STRING
sys.exit(1)
if __name__ == "__main__":
main(sys.argv)
@werowe

This comment has been minimized.

Copy link

@werowe werowe commented Jun 16, 2017

Great tool. It must be kind of old as not working with Python 2.6, so to make it work with Python 3.6 you need to remove the parameter name and just write:

post_type = post.find('wp:post_type', NAMESPACES).text

and put () around each print statement

and I just deleted this line as it generates a syntax error due to the change in Python version as well:

print "Saved {} posts and {} pages in directory '{}'.".format(
post_counter, page_counter, output_dir)

@aharonium

This comment has been minimized.

Copy link

@aharonium aharonium commented Jul 19, 2019

Thank you. I've been using your script to provide an alternative and accessible archive of our site's posts:

I've been struggling with capturing certain metadata: categories, tags, co-authors, as well as wp:postmeta data stored in a wp:meta_key/we:meta_value format. Here's my fork, so far.

@jaffermaniar

This comment has been minimized.

Copy link

@jaffermaniar jaffermaniar commented Mar 26, 2020

This script works beautifully. It's exactly what I needed considering that I decided to take a site offline forever but keep all the content locally. Thank you.

On Mac osX:

  1. I moved both wxr2txt.py and example.xml to a desired folder.
  2. went to Terminal and cd ~/path/to/desired/folder and
  3. executed the command: python wxr2txt.py example.xml

and voila: all my post, drafts, pages appeared in over a hundred separate .txt files.

One shortcoming, of this script is you will need to clean up the HTML tags and comments such as <!-- wp:paragraph --> and <p> but a simple find and replace all with a <blank-space> in your text editor should do the trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.