Skip to content

Instantly share code, notes, and snippets.

@phansson
Created February 26, 2017 20:21
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save phansson/ac586b2b7594ec5c3cc6275ad078ba2b to your computer and use it in GitHub Desktop.
Save phansson/ac586b2b7594ec5c3cc6275ad078ba2b to your computer and use it in GitHub Desktop.
SafeHtml utility class for making strings safe to use in HTML

SafeHtml

This small Java class is a utility class for escaping strings so that they are safe to use in HTML. There's a single static method, htmlEscape(), which does the job.

I've found that all existing solutions (libraries) I've reviewed suffered from one or several of the below issues:

  • They escape too much ... which makes the HTML much harder to read and takes longer time.
  • They don't tell you in the Javadoc exactly what they replace.
  • They do not document when the returned value is safe to use (safe to use for an HTML entity?, for an HTML attributute?, etc)
  • They are not optimized for speed.
  • They do not have a feature for avoiding double escaping (do not escape what is already escaped)
  • They replace single quote with ' (wrong!)

So I rolled my own. Guilty. Needless to say this utility class suffers from none of the above problems.

What is being escaped ? (or rather: replaced)

The following characters will be escaped:

  • & (ampersand) -- replaced with &
  • < (less than) -- replaced with &lt;
  • > (greater than) -- replaced with &gt;
  • " (double quote) -- replaced with &quot;
  • ' (single quote) -- replaced with &#39;
  • / (forward slash) -- replaced with &#47;

Justification: It is not necessary to escape more than this as long as the HTML page uses a Unicode encoding. (indeed most web pages uses UTF-8 which is also the HTML5 recommendation). Escaping more than this makes the HTML much less readable.

Double escaping

In most cases you would want to avoid escaping what is already escaped.

If you don't, then the string Mont Blanc &lt; Mount Everest will become Mont Blanc &amp;lt; Mount Everest and this is rarely what you want.

If you set avoidDoubleEscape=true then any of the following inside the string will be left untouched:

  • &amp; or &lt; or &gt; or &quot;
  • decimal Unicode values on the form &#dddd;
  • hexadecimal Unicode values on the form &#xhhhh;
This is free and unencumbered software released into the public domain.
Anyone is free to copy, modify, publish, use, compile, sell, or
distribute this software, either in source code form or as a compiled
binary, for any purpose, commercial or non-commercial, and by any
means.
In jurisdictions that recognize copyright laws, the author or authors
of this software dedicate any and all copyright interest in the
software to the public domain. We make this dedication for the benefit
of the public at large and to the detriment of our heirs and
successors. We intend this dedication to be an overt act of
relinquishment in perpetuity of all present and future rights to this
software under copyright law.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR
OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
OTHER DEALINGS IN THE SOFTWARE.
For more information, please refer to <http://unlicense.org>
/*
* This is free and unencumbered software released into the public domain.
*
* See accompanying LICENSE file for more information or see
* <http://unlicense.org/>
*/
/**
* HTML string utility.
*/
public class SafeHtml {
private SafeHtml() {
}
/**
* Escapes a string for use in an HTML entity or HTML attribute.
*
* <p>
* The returned value is always suitable for an HTML <i>entity</i> but only
* suitable for an HTML <i>attribute</i> if the attribute value is inside
* double quotes. So for attribute values be sure to do like this:
* <pre>
* &lt;div title="value-from-this-method" &gt; ....
* </pre>
* Putting attribute values in double quotes is always a good idea
* anyway.
*
* <p>
* The following characters will be escaped:
* <ul>
* <li>{@code &} (ampersand) -- replaced with {@code &amp;}</li>
* <li>{@code <} (less than) -- replaced with {@code &lt;}</li>
* <li>{@code >} (greater than) -- replaced with {@code &gt;}</li>
* <li>{@code "} (double quote) -- replaced with {@code &quot;}</li>
* <li>{@code '} (single quote) -- replaced with {@code &#39;}</li>
* <li>{@code /} (forward slash) -- replaced with {@code &#47;}</li>
* </ul>
* Justification: It is not necessary to escape more than this as long as
* the HTML page
* <a href="https://en.wikipedia.org/wiki/Character_encodings_in_HTML">uses
* a Unicode encoding</a>. (Most web pages uses UTF-8 which is also the
* HTML5 recommendation.). Escaping more than this makes the HTML much less
* readable.
*
* @param str the string to make HTML safe
* @param avoidDoubleEscape avoid double escaping, which means for example
* not escaping {@code &lt;} one more time. Any sequence {@code &....;}, as
* explained in
* {@link #isHtmlCharEntityRef(java.lang.String, int) isHtmlCharEntityRef()},
* will not be escaped.
*
* @return a HTML safe string
*/
public static String htmlEscape(String str, boolean avoidDoubleEscape) {
if (str == null || str.length() == 0) {
return str;
}
// Implementation: Most likely this can be further optimized
// by finding a way to lazily instantiate the StringBuilder, because
// most often there will be strings where there's nothing to
// escape at all and in that case it will be much faster not to
// do an unneseccary copy of the string.
StringBuilder sb = new StringBuilder(str.length() + 16);
for (int i = 0; i < str.length(); i++) {
char c = str.charAt(i);
switch (c) {
case '&':
// Avoid double escaping if already escaped
if (avoidDoubleEscape && (isHtmlCharEntityRef(str, i))) {
sb.append(c);
} else {
sb.append("&amp;");
}
break;
case '<':
sb.append("&lt;");
break;
case '>':
sb.append("&gt;");
break;
case '"':
sb.append("&quot;");
break;
case '\'':
sb.append("&#39;");
break;
case '/':
sb.append("&#47;");
break;
default:
sb.append(c);
}
}
return sb.toString();
}
/**
* Escapes a string for use in an HTML entity or HTML attribute.
* Double escaping is avoided, meaning this method is equivalent to
* calling {@code htmlEscape(str, true)}.
*
* @see #htmlEscape(java.lang.String, boolean)
*
* @param str the string to make HTML safe
* @return a HTML safe string
*/
public static String htmlEscape(String str) {
return htmlEscape(str, true);
}
/**
* Checks if the value at {@code index} in {@code str} is a HTML entity
* reference. This means any of :
* <ul>
* <li>{@code &amp;} or {@code &lt;} or {@code &gt;} or {@code &quot;} </li>
* <li>A value of the form {@code &#dddd;} where {@code dddd} is a decimal
* value</li>
* <li>A value of the form {@code &#xhhhh;} where {@code hhhh} is a
* hexadecimal value</li>
* </ul>
*
* @param str the string to test for HTML entity reference.
* @param index position of the {@code '&'} in {@code str}
* @return {@code true} is there's a HTML entity reference at the
* index position, otherwise false.
*/
public static boolean isHtmlCharEntityRef(String str, int index) {
if (str.charAt(index) != '&') {
return false;
}
int indexOfSemicolon = str.indexOf(';', index + 1);
if (indexOfSemicolon == -1) { // is there a semicolon sometime later ?
return false;
}
if (!(indexOfSemicolon > (index + 2))) { // is the string actually long enough
return false;
}
if (followingCharsAre(str, index, "amp;")
|| followingCharsAre(str, index, "lt;")
|| followingCharsAre(str, index, "gt;")
|| followingCharsAre(str, index, "quot;")) {
return true;
}
if (str.charAt(index + 1) == '#') {
if (str.charAt(index + 2) == 'x' || str.charAt(index + 2) == 'X') {
// It's presumably a hex value
if (str.charAt(index + 3) == ';') {
return false;
}
for (int i = index + 3; i < indexOfSemicolon; i++) {
char c = str.charAt(i);
if (c >= 48 && c <= 57) { // 0 -- 9
continue;
}
if (c >= 65 && c <= 70) { // A -- F
continue;
}
if (c >= 97 && c <= 102) { // a -- f
continue;
}
return false;
}
return true; // yes, the value is a hex string
} else {
// It's presumably a decimal value
for (int i = index + 2; i < indexOfSemicolon; i++) {
char c = str.charAt(i);
if (c >= 48 && c <= 57) { // 0 -- 9
continue;
}
return false;
}
return true; // yes, the value is decimal
}
}
return false;
}
/**
* Tests if the chars following position {@code startIndex} in string
* {@code str} are that of {@code nextChars}.
*
* @param str
* @param startIndex
* @param nextChars
* @return
*/
private static boolean followingCharsAre(String str, int startIndex, String nextChars) {
return (str.indexOf(nextChars, startIndex + 1) == (startIndex + 1));
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment