Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Fixes improper close order in XML
//build regexes without worrying about
// - double-backslashing
// - adding whitespace for readability
// - adding in comments
const clean = (piece) => (piece
.replace(/((^|\n)(?:[^\/\\]|\/[^*\/]|\\.)*?)\s*\/\*(?:[^*]|\*[^\/])*(\*\/|)/g, '$1')
.replace(/((^|\n)(?:[^\/\\]|\/[^\/]|\\.)*?)\s*\/\/[^\n]*/g, '$1')
.replace(/\n\s*/g, '')
);
const regex = ({raw}, ...interpolations) => (
new RegExp(interpolations.reduce(
(regex, insert, index) => (regex + insert + clean(raw[index + 1])),
clean(raw[0])
))
);
const xfcwCache = {};
const xmlFixClosedWithin = (what = '[^\s<>"/\\=&]+') => ([xfcwCache[what] || (xfcwCache[what] = regex`
(?<=<)(${what})(\s(?:[^>"/]|"[^"]*")*|)(>
(?:
[^<]
|<(?!\1[\s<>"/\\=&])([^\s<>"/\\=&]+)(?:
\s(?:
[^>"/]
|"[^"]*"
)*
|
)(?:
\/>
|>[^<]*<\/\4>
)
)*
<\/)(?!\1)([^\s<>"/\\=&]+)(?=>)
`), '$1$2$3$1></$5><$1$2']);
const xfowCache = {};
const xmlFixOpenedWithin = (what = '[^\s<>"/\\=&]+') => ([xfowCache[what] || (xfowCache[what] = regex`
(?<=<)(${what})(\s(?:[^>"/]|"[^"]*")*|)(>
(?:
[^<]
|<(?!\1[\s<>"/\\=&])([^\s<>"/\\=&]+)(?:
\s(?:
[^>"/]
|"[^"]*"
)*
|
)(?:
\/>
|>[^<]*<\/\4>
)
|</(?!\1)(?:[\s<>"/\\=&]+)>
)*
<)([^\s<>"/\\=&]+)(\s(?:[^>"/]|"[^"]*")*|)(?=>
(?:
[^<]
|<(?!(?:\1|\5)[\s<>"/\\=&])[^\s<>"/\\=&]+(?:
\s(?:
[^>"/]
|"[^"]*"
)*
|
)\/?>
|</(?!\1|\5)(?:[\s<>"/\\=&]+)>
)*
<\/\1>)
`), '$1$2$3/$1><$5$6><$1$2']);
const fixXML = (xml, fixes = [xmlFixClosedWithin(), xmlFixOpenedWithin()]) => {
if ((typeof fixes == 'string') || fixes instanceof String)
fixes = [xmlFixClosedWithin(fixes), xmlFixOpenedWithin(fixes)];
let iterations = 10;
for (
let change = '';
change != xml && --iterations && (change = xml);
) {
for (let [problem, fix] of fixes)
xml = xml.replace(problem, fix);
}
if (!iterations)
throw new Error('Didn\'t manage to rectify the xml within 10 changes');
return xml;
};
@Hashbrown777

This comment has been minimized.

Copy link
Owner Author

@Hashbrown777 Hashbrown777 commented May 18, 2021

Algorithm 1; find nodes that were closed within a parent, but weren't opened there:
Close and reopen the parent around the child close tag.

fixXML(`<text top="845" left="136" width="284" height="14" font="1">
    <b>Via website<a href="http://www.redistribution.nsw.gov.au">: </b>www.redistribution.nsw.gov.au </a>
</text>`, [xmlFixClosedWithin()])
/*<text top="845" left="136" width="284" height="14" font="1">
    <b>Via website<a href="http://www.redistribution.nsw.gov.au">: </a></b><a href="http://www.redistribution.nsw.gov.au">www.redistribution.nsw.gov.au </a>
</text>*/

fixXML('<b>bold<i>bitalic<u>bundertalic</b>undertalic</i>underline</u>', [xmlFixClosedWithin()])
//<b>bold<i>bitalic<u>bundertalic</u></i></b><i><u>undertalic</u></i><u>underline</u>

Algorithm 2; find nodes that were opened within a parent, but weren't closed in time:
Close and reopen the parent around the child open tag.

fixXML(`<text top="845" left="136" width="284" height="14" font="1">
    <b>Via website<a href="http://www.redistribution.nsw.gov.au">: </b>www.redistribution.nsw.gov.au </a>
</text>`, [xmlFixOpenedWithin()])
/*<text top="845" left="136" width="284" height="14" font="1">
    <b>Via website</b><a href="http://www.redistribution.nsw.gov.au"><b>: </b>www.redistribution.nsw.gov.au </a>
</text>*/

fixXML('<b>bold<i>bitalic<u>bundertalic</b>undertalic</i>underline</u>', [xmlFixOpenedWithin()])
//<b>bold</b><i><b>bitalic</b><u><b>bundertalic</b>undertalic</i>underline</u>

Use both algorithms, but only allow duplicating tags for certain tagnames:

fixXML(`<text top="845" left="136" width="284" height="14" font="1">
    <b>Via website<a href="http://www.redistribution.nsw.gov.au">: </b>www.redistribution.nsw.gov.au </a>
</text>`, 'b')
/*<text top="845" left="136" width="284" height="14" font="1">
    <b>Via website</b><a href="http://www.redistribution.nsw.gov.au"><b>: </b>www.redistribution.nsw.gov.au </a>
</text>*/

fixXML('<b>bold<i>bitalic<u>bundertalic</b>undertalic</i>underline</u>', 'i|u')
//<b>bold<i>bitalic<u>bundertalic</u></i></b><i></i><u><i>undertalic</i>underline</u>

Assumes no CDATA, comments, or chevrons within attribute values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment