Skip to content

Instantly share code, notes, and snippets.

@cemeyer
Created September 15, 2013 16:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cemeyer/6572010 to your computer and use it in GitHub Desktop.
Save cemeyer/6572010 to your computer and use it in GitHub Desktop.
Hack: Improving XML bad-character filtering performance by 10x (Qt4)

Backstory: Some Unicode code points are invalid in XML; specifically, we want to filter the code points (hex) 0-8, b-c, and e-1f. Qt4's QXmlStreamWriter() makes no attempt to filter these codepoints in the documents it writes.

So, here's our baseline, with the same benchmark we'll use later (big.gpx is a giant 142 MB file, and GPX is an XML-based format):

./gpsbabel -i gpx -f big.gpx [-o gpx -F /dev/null] (x5)

17.2 user, 4.8 system, 22.1 total

In GPSBabel, we already subclass QXmlStreamWriter to add other convenience methods. So our initial attempt to filter bad characters was simply to interpose on all methods that take potentially unfiltered inputs (writeAttribute(), writeTextElement(), ...) and apply a QString::replace() substitution with a QRegExp.

It turns out QRegExp is really slow. Slower than PCRE. In my benchmarks, QRegExp costs us an extra 55% CPU time:

26.7 user, 4.7 system, 31.5 total

Qt5 adds QRegularExpression, built on PCRE, which is moderately faster. Unfortunately, I don't have Qt5 locally (to benchmark). Nor do many people -- so we need something reasonable on Qt4. For example, RHEL/Centos 6 is stuck on Qt 4.6.

So, here's the hack: We create our own QTextCodec (responsible for translating to and from the internal UTF-16 QString encoding) which mostly defers to built-in Qt UTF-8 text codec, but performs the additional step of stripping out invalid codepoints.

QByteArray XmlTextCodec::convertFromUnicode(const QChar* c, int n, QTextCodec::ConverterState* s) const
{
  QByteArray r = utf8Codec->fromUnicode(c, n, s); 
  char* data = r.data();
  for (int i = 0; i < r.size(); i++) {
    if ((0x00 <= data[i] && data[i] <= 0x08) ||
        (0x0b <= data[i] && data[i] <= 0x0c) ||
        (0x0e <= data[i] && data[i] <= 0x1f)) {
      data[i] = ' ';
    }   
  }
  return r;
}

This is valid for two reasons inherent to UTF-8: Any codepoint in [0, 127] (US-ASCII) is encoded as the same byte in UTF-8; and, any multi-byte UTF-8 sequence will have the high-order bit, 0x80, set. So, we don't clobber any valid UTF-8 sequence on accident.

The new benchmark?

18.1 user, 4.7 system, 22.8 total

That's 5% slower than our baseline (without any bad-character filtering). And 10x less CPU time spent than using QRegExp. Win win?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment