Skip to content

Instantly share code, notes, and snippets.

@joshtynjala
Last active April 12, 2024 20:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save joshtynjala/e85b580f7cda22e618924fc783bc4133 to your computer and use it in GitHub Desktop.
Save joshtynjala/e85b580f7cda22e618924fc783bc4133 to your computer and use it in GitHub Desktop.
Normalizes the HTML in old documentation from help.adobe.com that is Creative Commons licensed (to make it easier to clean up and convert into other formats, like Markdown, with tools like pandoc)
document.body.querySelector("#ahpod")?.remove();
document.body.querySelector("#mboxScriptContainer")?.remove();
let h1 = document.querySelector("#content_wrapper h1:first-of-type");
let article = document.querySelector(
"table#inner_content_table td:first-of-type"
);
article.querySelector("div:first-of-type")?.remove();
article.querySelector("#chcPromo")?.remove();
article.querySelectorAll("script")?.forEach((element) => element.remove());
article.querySelector("#userprefs")?.remove();
// article.querySelector("#related")?.remove();
article.querySelector("#footer")?.remove();
article.querySelector("#minitoc")?.remove();
article
.querySelectorAll('a[name^="WS"')
?.forEach((element) => element.remove());
article
.querySelectorAll('a[href^="#top"]')
?.forEach((element) => element.remove());
article
.querySelectorAll('*[width^="NaN%"]')
?.forEach((element) => element.removeAttribute("width"));
article
.querySelectorAll("a")
?.forEach((element) => element.removeAttribute("target"));
document.body.querySelectorAll("*")?.forEach((element) => {
element.removeAttribute("class");
element.removeAttribute("id");
element.removeAttribute("style");
element.removeAttribute("valign");
element.removeAttribute("headers");
element.removeAttribute("border");
element.removeAttribute("cellpadding");
element.removeAttribute("cellspacing");
element.removeAttribute("align");
element.removeAttribute("valign");
});
article
.querySelectorAll("a")
?.forEach((element) => element.removeAttribute("onclick"));
document.body.innerHTML = "";
document.body.appendChild(h1);
document.body.appendChild(article);
document.body.innerHTML = document.body.innerHTML
.replaceAll(/( |\n)<samp>\s+/g, "$1<samp>") //no whitespace after <samp>
.replaceAll(/\s+<\/samp>/g, "</samp>") // no whitespace before </samp>
.replaceAll(/(<\/\w+>)\s+(\.|,|:|\))/g, "$1$2") // no whitespace between end tag and punctuation
.replaceAll(/\s+(<\/a>)/g, "$1"); // no whitespace before </a>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment