Created
July 17, 2013 12:57
-
-
Save melix/6020336 to your computer and use it in GitHub Desktop.
Convert Confluence HTML export into asciidoc
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@Grab('net.sourceforge.htmlcleaner:htmlcleaner:2.4') | |
import org.htmlcleaner.* | |
def src = new File('html').toPath() | |
def dst = new File('asciidoc').toPath() | |
def cleaner = new HtmlCleaner() | |
def props = cleaner.properties | |
props.translateSpecialEntities = false | |
def serializer = new SimpleHtmlSerializer(props) | |
src.toFile().eachFileRecurse { f -> | |
def relative = src.relativize(f.toPath()) | |
def target = dst.resolve(relative) | |
if (f.isDirectory()) { | |
target.toFile().mkdir() | |
} else if (f.name.endsWith('.html')) { | |
def tmpHtml = File.createTempFile('clean', 'html') | |
println "Converting $relative" | |
def result = cleaner.clean(f) | |
result.traverse({ tagNode, htmlNode -> | |
tagNode?.attributes?.remove 'class' | |
if ('td' == tagNode?.name || 'th'==tagNode?.name) { | |
tagNode.name='td' | |
String txt = tagNode.text | |
tagNode.removeAllChildren() | |
tagNode.insertChild(0, new ContentNode(txt)) | |
} | |
true | |
} as TagNodeVisitor) | |
serializer.writeToFile( | |
result, tmpHtml.absolutePath, "utf-8" | |
) | |
"pandoc -f html -t asciidoc -R -S --normalize -s $tmpHtml -o ${target}.adoc".execute().waitFor() | |
tmpHtml.delete() | |
}/* else { | |
"cp html/$relative $target".execute() | |
}*/ | |
} |
I don't recall this, it's super old, but I don't think it deals with attached files.
Atlassian might have inched forward too - using the CLI tool looks like it is possible to get to a JSON page format that might more neatly map to Asciidoc equivalents;
https://developer.atlassian.com/cloud/jira/platform/apis/document/structure/
This gist is still going strong and referenced in the official Asciidoc doco:
https://docs.asciidoctor.org/asciidoctor/latest/migrate/confluence-xhtml/
Had to adapt the parameters for current pandoc, added some rudimentary error handling.
https://gist.github.com/bdabelow/67db92c7bd33687353fd8a07ede9ff5c
Is the line 22, tagNode?.attributes?.remove 'class'
, a good idea? For me it breaks conversion of code blocks.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Based off this article that references this script;
https://docs.asciidoctor.org/asciidoctor/latest/migrate/confluence-xhtml
Do you recall the logic and if it would deal with referenced / attached files and/or images in the script logic?
When extracting to HTML images look like this:
Created a self-contained Docker image to run the conversion - using the Atlassian CLI base image + adding Pandoc/Groovy here:
https://github.com/npiper/confluence.extract