Skip to content

Instantly share code, notes, and snippets.

@melix
Created July 17, 2013 12:57
Show Gist options
  • Save melix/6020336 to your computer and use it in GitHub Desktop.
Save melix/6020336 to your computer and use it in GitHub Desktop.
Convert Confluence HTML export into asciidoc
@Grab('net.sourceforge.htmlcleaner:htmlcleaner:2.4')
import org.htmlcleaner.*
def src = new File('html').toPath()
def dst = new File('asciidoc').toPath()
def cleaner = new HtmlCleaner()
def props = cleaner.properties
props.translateSpecialEntities = false
def serializer = new SimpleHtmlSerializer(props)
src.toFile().eachFileRecurse { f ->
def relative = src.relativize(f.toPath())
def target = dst.resolve(relative)
if (f.isDirectory()) {
target.toFile().mkdir()
} else if (f.name.endsWith('.html')) {
def tmpHtml = File.createTempFile('clean', 'html')
println "Converting $relative"
def result = cleaner.clean(f)
result.traverse({ tagNode, htmlNode ->
tagNode?.attributes?.remove 'class'
if ('td' == tagNode?.name || 'th'==tagNode?.name) {
tagNode.name='td'
String txt = tagNode.text
tagNode.removeAllChildren()
tagNode.insertChild(0, new ContentNode(txt))
}
true
} as TagNodeVisitor)
serializer.writeToFile(
result, tmpHtml.absolutePath, "utf-8"
)
"pandoc -f html -t asciidoc -R -S --normalize -s $tmpHtml -o ${target}.adoc".execute().waitFor()
tmpHtml.delete()
}/* else {
"cp html/$relative $target".execute()
}*/
}
@npiper
Copy link

npiper commented Jul 12, 2023

Based off this article that references this script;
https://docs.asciidoctor.org/asciidoctor/latest/migrate/confluence-xhtml

Do you recall the logic and if it would deal with referenced / attached files and/or images in the script logic?

When extracting to HTML images look like this:

<ac:image ac:align="center" ac:layout="center" ac:original-height="680" ac:original-width="1540"><ri:attachment ri:filename="CI_CD%20Pipeline%20Azure%20Hybrid%20(2).png?version=1&amp;modificationDate=1573576456718&amp;cacheVersion=1&amp;api=v2&amp;width=1000" ri:version-at-save="1" /></ac:image>

Created a self-contained Docker image to run the conversion - using the Atlassian CLI base image + adding Pandoc/Groovy here:
https://github.com/npiper/confluence.extract

@melix
Copy link
Author

melix commented Jul 12, 2023

I don't recall this, it's super old, but I don't think it deals with attached files.

@npiper
Copy link

npiper commented Jul 18, 2023

Atlassian might have inched forward too - using the CLI tool looks like it is possible to get to a JSON page format that might more neatly map to Asciidoc equivalents;

https://developer.atlassian.com/cloud/jira/platform/apis/document/structure/

This gist is still going strong and referenced in the official Asciidoc doco:
https://docs.asciidoctor.org/asciidoctor/latest/migrate/confluence-xhtml/

@bdabelow
Copy link

Had to adapt the parameters for current pandoc, added some rudimentary error handling.

https://gist.github.com/bdabelow/67db92c7bd33687353fd8a07ede9ff5c

@thugcee
Copy link

thugcee commented Mar 26, 2024

Is the line 22, tagNode?.attributes?.remove 'class', a good idea? For me it breaks conversion of code blocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment