cemeyer/2013-09-15.md

## 2013-09-15.md

      
    Raw
  

              2013-09-15.md
            
          
    Backstory: Some Unicode code points are invalid in XML; specifically, we want to filter the code points (hex) 0-8, b-c, and e-1f. Qt4's QXmlStreamWriter() makes no attempt to filter these codepoints in the documents it writes.
So, here's our baseline, with the same benchmark we'll use later (big.gpx is a giant 142 MB file, and GPX is an XML-based format):

./gpsbabel -i gpx -f big.gpx [-o gpx -F /dev/null] (x5)
17.2 user, 4.8 system, 22.1 total

In GPSBabel, we already subclass QXmlStreamWriter to add other convenience methods. So our initial attempt to filter bad characters was simply to interpose on all methods that take potentially unfiltered inputs (writeAttribute(), writeTextElement(), ...) and apply a QString::replace() substitution with a QRegExp.
It turns out QRegExp is really slow. Slower than PCRE. In my benchmarks, QRegExp costs us an extra 55% CPU time:

26.7 user, 4.7 system, 31.5 total

Qt5 adds QRegularExpression, built on PCRE, which is moderately faster. Unfortunately, I don't have Qt5 locally (to benchmark). Nor do many people -- so we need something reasonable on Qt4. For example, RHEL/Centos 6 is stuck on Qt 4.6.
So, here's the hack: We create our own QTextCodec (responsible for translating to and from the internal UTF-16 QString encoding) which mostly defers to built-in Qt UTF-8 text codec, but performs the additional step of stripping out invalid codepoints.
QByteArray XmlTextCodec::convertFromUnicode(const QChar* c, int n, QTextCodec::ConverterState* s) const
{
  QByteArray r = utf8Codec->fromUnicode(c, n, s); 
  char* data = r.data();
  for (int i = 0; i < r.size(); i++) {
    if ((0x00 <= data[i] && data[i] <= 0x08) ||
        (0x0b <= data[i] && data[i] <= 0x0c) ||
        (0x0e <= data[i] && data[i] <= 0x1f)) {
      data[i] = ' ';
    }   
  }
  return r;
}
This is valid for two reasons inherent to UTF-8: Any codepoint in [0, 127] (US-ASCII) is encoded as the same byte in UTF-8; and, any multi-byte UTF-8 sequence will have the high-order bit, 0x80, set. So, we don't clobber any valid UTF-8 sequence on accident.
The new benchmark?

18.1 user, 4.7 system, 22.8 total

That's 5% slower than our baseline (without any bad-character filtering). And 10x less CPU time spent than using QRegExp. Win win?