Skip to content

Instantly share code, notes, and snippets.

@Btibert3
Last active February 17, 2020 01:36
Show Gist options
  • Save Btibert3/d4b03c3a666f7905c708 to your computer and use it in GitHub Desktop.
Save Btibert3/d4b03c3a666f7905c708 to your computer and use it in GitHub Desktop.
Use R to convert Wordpress Posts to Markdown for Jekyll blogs

About

This code sample is how I used R to query the data loaded into Mysql from the Wordpress backup, converted the html to markdown, and moved into a jekyll blog.

Backup Wordpress

Depending on your provider, you should be able to access backup files of your MySQL database which contains the database for your blog. The files will be SQL files.

Using MySQL Workbench, create a database, activate it, using the following:

CREATE DATABASE blog;
USE blog;

In the workbench window, create a new script, and simply COPY/PASTE the contents of the SQL file into the new script. Run the script, and voila, you know have your blog within MySQL to query.

NOTE: I opened the SQL file from the backup in Sublime Text, and from there, just pasted into the new SQL file I opened in the Workbench. I couldnt get the restore feature to work, but c'est la vie. It was faster to copy/paste given this is one time thing.

The R Script

It's very important to note that I couldn't automate this. I had a loop to re-build my blog dynamically, but some posts had UTF-8 conversion issues. To get around those issues, I used the iconv function.

For me, it was much easier to debug the conversion process post-by-post. I didn't have that many, so it was much faster and easier to iterate over each post as opposed to having a script that handled errors.

Summary

It was surprisingly easy to convert my blog from Wordpress into jekyll using my hosting provider's backup, MySQL, and R.

#################################################################################
## Get the posts we want from the database put into mysql
## take the backup, create database, copy/paste the sql and execute script
## in mysql workbench.
#################################################################################
## options
options(stringsAsFactors=F)
## load the libraries
library(RMySQL)
library(stringr)
## connect to the database
ch = dbConnect(RMySQL::MySQL(), user="username", password="password", dbname="dbname")
dbListTables(ch)
## get the posts
SQL = "SELECT * FROM blog.ait_posts WHERE post_status = 'publish';"
dat = dbGetQuery(ch, SQL)
colnames(dat)
## select the posts you want to keep
dat$post_title
###############################################################################
## The code I manually walk through for each post
###############################################################################
## the record we want to keep from the post tile above
K = 11
## save the record data
tmp = dat[K, ]
## save the data
title = tmp$post_title
date = tmp$post_date
post = tmp$post_content
## sometimes there will be a pandoc conversion error (expecting UTF-8 stream)
## if you get this, convert the post by uncommenting below
# post = iconv(post, "LATIN2", "UTF-8")
## filename is the directory/
## YYYY-MM-DD-POST-TITLE-LIKE-THIS.md
FNAME = paste0("posts/",
str_extract(date, "[0-9]{4}-[0-9]{2}-[0-9]{2}"),
"-",
str_replace_all(title, " ", "-"),
".md")
## write the post to an html file
fileConn = file("posts/temp.html")
writeLines(post, fileConn)
close(fileConn)
## convert to markdown file with pandoc
CMD = paste0("pandoc -f html -t markdown posts/temp.html -o ", FNAME)
system(CMD)
## read the file back so we can add the front matter
TITLE = paste0("# ", title, "\n")
MD_FILE = list.files("posts", pattern = ".md$", full.names = T)
fileConn = file(MD_FILE)
LINES = readLines(fileConn)
writeLines(c("---",
"layout: post",
"---",
"",
TITLE,
"",
LINES),
fileConn)
close(fileConn)
## copy the markdown file to the blog directory
B_DIR = "~/github/btibert3.github.io/_posts/"
MD_FILE_FULL = list.files("posts", pattern = ".md$", full.names = T)
MD_FILE = list.files("posts", pattern = ".md$")
file.copy(MD_FILE_FULL, to = paste0(B_DIR, MD_FILE), overwrite = T)
## cleanup
rm(CMD, title, date, post, FNAME, tmp, fileConn, B_DIR, LINES, MD_FILE, MD_FILE_FULL, TITLE)
## unlink the temp files
FILES = list.files("posts", full.names=T)
unlink(FILES)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment