Skip to content

Instantly share code, notes, and snippets.

@WebReflection
Last active August 21, 2022 16:27
Show Gist options
  • Star 30 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save WebReflection/df05641bd04954f6d366 to your computer and use it in GitHub Desktop.
Save WebReflection/df05641bd04954f6d366 to your computer and use it in GitHub Desktop.
How to escape and unescape from a language to another

update

I've created a little repository that simply exposes the final utility as npm module.

It's called html-escaper


there is basically one rule only: do not ever replace one char after another if you are transforming a string into another.

// WARNING: THIS IS WRONG
// if you are that kind of dev that does this
function escape(s) {
  return s.replace(/&/g, "&")
          .replace(/</g, "&lt;")
          .replace(/>/g, "&gt;")
          .replace(/'/g, "&#39;")
          .replace(/"/g, "&quot;");
}

// you might be the same dev that does this too
function unescape(s) {
  return s.replace(/&amp;/g, "&")
          .replace(/&lt;/g, "<")
          .replace(/&gt;/g, ">")
          .replace(/&#39;/g, "'")
          .replace(/&quot;/g, '"');
}

// guess what we have here ?
unescape('&amp;lt;');

// now guess this XSS too ...
unescape('&amp;lt;script&amp;gt;alert("yo")&amp;lt;/script&amp;gt;');

The last example will produce <script>alert("yo")</script> instead of the expected &lt;script&gt;alert("yo")&lt;/script&gt;.

Nothing like this could possibly happen if we grab all chars at once and either ways. It's just a fortunate case that after swapping & with &amp; no other replace will be affected, but it's not portable and universally a bad practice.

Grab all chars at once, no excuses!

// with "any char" compatible HTML escaping
function escape(s) {
  return s.replace(/[&<>'"]/g, function (m) {
    return '&#' + m.charCodeAt(0) + ';';
  });
}

// with predefined object (preferred)
function escape(s) {
  var escaped = {
    '&': '&amp;',
    '<': '&lt;',
    '>': '&gt;',
    "'": '&#39;',
    '"': '&quot;'
  };
  return s.replace(/[&<>'"]/g, function (m) {
    return escaped[m];
  });
}

// with predefined object specific
// for HTML entities only
function unescape(s) {
  var re = /&(?:amp|#38|lt|#60|gt|#62|apos|#39|quot|#34);/g;
  var unescaped = {
    '&amp;': '&',
    '&#38;': '&',
    '&lt;': '<',
    '&#60;': '<',
    '&gt;': '>',
    '&#62;': '>',
    '&apos;': "'",
    '&#39;': "'",
    '&quot;': '"',
    '&#34;': '"'
  };
  return s.replace(re, function (m) {
    return unescaped[m];
  });
}

There is no risk with above code that any char after or before another could interfere with others, you escape and you unescape, it's a 1 to 1 operation, no surprises in the middle.

You'd like to have a little utility?

var html = require('html-escaper');

// you can test like this
var unescaped = '<&>"\'';
var escaped = html.escape(unescaped);
html.unescape(escaped) === unescaped;
@mathiasbynens
Copy link

Oh man, I’ve been there!

@WebReflection
Copy link
Author

more details
As somebody might think it's an unescape issue only, it's not. Being an anti-pattern with side effects works both ways.

As example, changing the order of the replacement in escaping would produce the unexpected:

function escape(s) {
  return s.replace(/</g, "&lt;")
          .replace(/>/g, "&gt;")
          .replace(/'/g, "&#39;")
          .replace(/"/g, "&quot;")
          .replace(/&/g, "&amp;");
}

escape('<'); // &amp;lt; instead of &lt;

If we do not want to code with the fear that the order wasn't perfect or that our order in either escaping or unescaping is different from the order another method or function used, if we understand the issue and we agree it's potentially a disaster prone approach, if we add the fact in this case creating 4 RegExp objects each time and invoking 4 times .replace trough the String.prototype is also potentially slower than creating one function only holding one object, or holding the function too, we should agree there is not absolutely any valid reason to keep proposing a char-by-char implementation.

We have proofs this approach can fail already so ... why should we risk? Just avoid and grab all chars at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment