Skip to content

Instantly share code, notes, and snippets.

@gerrard00
Created September 13, 2017 05:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gerrard00/1a2584adcc84163f4d8cc69e40022092 to your computer and use it in GitHub Desktop.
Save gerrard00/1a2584adcc84163f4d8cc69e40022092 to your computer and use it in GitHub Desktop.
Xml Compression tests
using System;
using System.Diagnostics;
using System.IO;
using System.Xml;
namespace Test
{
class Test
{
static void Main()
{
var stopWatch = new Stopwatch();
stopWatch.Start();
using(var inputStream = File.OpenRead(@"c:\junk\supp2017.xml"))
{
var doc = new XmlDocument();
doc.Load(inputStream);
}
stopWatch.Stop();
Console.WriteLine("Loading xml {0}", stopWatch.Elapsed.TotalMilliseconds);
Console.WriteLine("Done");
}
}
}
using System;
using System.Diagnostics;
using System.IO;
using System.IO.Compression;
using System.Xml;
namespace Test
{
class Test
{
static void Main()
{
var stopWatch = new Stopwatch();
stopWatch.Start();
using(var inputStream = File.OpenRead(@"c:\junk\supp2017.xml.cmp"))
{
using (DeflateStream decompressingStream = new DeflateStream(inputStream, CompressionMode.Decompress))
{
var doc = new XmlDocument();
doc.Load(decompressingStream);
}
}
stopWatch.Stop();
Console.WriteLine("Loading xml {0}", stopWatch.Elapsed.TotalMilliseconds);
Console.WriteLine("Done");
}
}
}
@gerrard00
Copy link
Author

I got downvoted on Stack Overflow which makes me sad:

https://stackoverflow.com/questions/14684412/how-to-compress-the-xml-file-at-the-time-of-writing/15907546

Well, there's no crying in baseball. Let's look at numbers. I used this sample xml file:

ftp://nlmpubs.nlm.nih.gov/online/mesh/MESH_FILES/xmlmesh/supp2017.xml

Windows explorer shows the file as 572K on disk. I then compressed the file using the C# DeflateStream class:

using (FileStream originalFileStream = File.OpenRead(@"c:\junk\supp2017.xml"))
        {
            
			using (FileStream compressedFileStream = File.Create(@"c:\junk\supp2017.xml.cmp"))
			{
				using (DeflateStream compressionStream = new DeflateStream(compressedFileStream, CompressionMode.Compress))
				{
					originalFileStream.CopyTo(compressionStream);
				}
			}
        }

The compressed file was only 44K on disk.

I tested loading the raw file into an XML document and loading the compressed file into an XML Document.

Here's a few example runs:

C:\junk>loadxmldeflated.exe
Loading xml 20905.3347
Done

C:\junk>loadxmlstream.exe
Loading xml 20322.838
Done

C:\junk>loadxmlstream.exe
Loading xml 22035.5008
Done

C:\junk>loadxmldeflated.exe
Loading xml 21247.6917
Done

I ran both implementations a few times and I didn't see any consistent performance increase and in fact the implementation with compression was sometimes slower. This isn't a scientific test by any stretch, but it seems to throw the downvoter's contention that the in memory work of building the massive object graph would be dwarfed by the cost of the disk I/O into question.

The reality is that loading a massive XML file into an XmlDocument is just a bad idea. I recommend using a streaming API instead of trying to get performance increases with compression which was my original answer.

@robert4
Copy link

robert4 commented Sep 13, 2017

You don't see the difference because the file sits in the file system cache of the OS already, because you've read it at least once since the last system restart. Yes, in such tests the non-compressing version must appear faster because it doesn't have to bother with decompression -- the OS gives to it the uncompressed file from memory (the file system cache). (By the way this shows the true cost of decompressing - that will be the same amount of time when the file will be really read from the disk.)

A true test would be to read files that the OS have not cached, which is somewhat tricky to ensure: for example 1) dismount the physical disk that contains the test .xml and re-mount it between every test (chkdsk /x can achieve this), and/or 2) process more files that fit into the file system cache, and/or 3) fill the file system cache with other files, e.g. run MD5 or SHA1 calculations on several huge files between the tests.
Testing with one huge XML file (that is larger than the file system cache = roughly all the available RAM) is not a good idea because then XmlDocument.Load() won't have enough RAM and thus heavy paging will occur. Testing right after system restart may also be misleading because Windows has services that (right after system restart) prefetch recently/frequently used files into memory in order to have them cached by the time they'll be used next (Superfetch and its ilk).
If you're running the tests in a VM then things may be even more complicated because the host OS also has a file system cache and I'm unsure about how that cooperates with the guest VM's virtualized disk accesses. However in real life scenarios when a user wants to process a GB-sized XML file with your software then that file is usually not in his/her file system cache, because he/she hasn't worked with that file yet: it's your program that will ask the OS to access/load that file and the user will attribute its slowness to your program.

@gerrard00
Copy link
Author

Please run a test and demonstrate the difference in performance. I think that would be more valuable than conjecture. At this point I still think the time taken to load that entire .Net object graph with hundreds of thousands of objects that will take up gigabytes of memory is the more expensive part of the process.

Disk IO is much slower than memory IO. I don't think anyone disputes that basic knowledge. The question is, is that difference larger than any other work you do once the data is loaded into memory? I don't think you can make that assertion. Loading gigabytes worth of data into hundreds of thousands of .Net objects is slow and expensive.

If the issue was disk caching wouldn't the first run of the tests have been much slower than the subsequent runs?

@robert4
Copy link

robert4 commented Sep 13, 2017

Indeed, the first run of the tests must have been much(?) slower than the subsequent runs. The question is whether that “much” is much or not so much, but should be clearly slower somewhat. Since it was not the case, it indicates that disk caching affected the tests.

I have downloaded the supp2017.xml file you used. (That's 572M, not 572K.) Then I used a different approach: I read the whole file into memory and measured the time of only XmlDocument.Load() on it:

byte[] data = File.ReadAllBytes(@"G:\supp2017.xml");
var m = new MemoryStream(data);
var stopWatch = new System.Diagnostics.Stopwatch();
stopWatch.Start();
var doc = new XmlDocument();
doc.Load(m);
stopWatch.Stop();
Console.WriteLine("XmlDocument.Load(): {0}ms", stopWatch.Elapsed.TotalMilliseconds);
Console.WriteLine("Done");

This reported 9.1-9.3 seconds on my computer. This is the cost of building the object graph in the given .NET environment, a work entirely in-memory. Now I should add the time it takes to load the file, with or without compression. In fact the question is not the sum, but the ratio of the two members of the sum to each other (graph-building vs loading). Since it's difficult to precisely measure the time it takes to load the file (due to the effect of the file system cache), I can use estimates, based on everyday practice. Actually we don't need the precise amount of time it takes to load the file, we only want to know whether is it much more or much less than 9s?
Clearly it depends on the speed of the media the XML file is being read from. I usually experience 100M/s reading speed with my HDD, therefore I estimate the cold loading of the 572M uncompressed file would be approx. 6s. When the file is compressed to 44M, it would load in 0.5s and then +3s decompression would incur (I measured it with a program similar to the above one, that read the compressed file to a byte[] array and then measured the time of decompression + XmlDocument.Load() together).

You are right that this “approx. 6s” is not much larger but instead smaller than the 9s cost of XmlDocument.Load(). Thus the cost of this in-memory work is not dwarfed by the disk I/O, and you are right that this in-memory work is the more expensive part of the process.
Yet the lack of compression makes the whole process 20% slower (15s vs. 12.5s), and this gets more and more pronounced as the speed of the media decreases:

Media Speed Loading uncompressed Loading compressed and decompressing Speed loss if not compressed
internal hdd 100M/s 6s 3.5s ×1.2 = (9+6)/(9+3.5)
external hdd 30M/s 19s 4.5s ×2.1 ≈ (9+19)/(9+4.5)
pendrive or NAS 10M/s 57s 7.5s ×4 ≈ (9+57)/(9+7.5)

@robert4
Copy link

robert4 commented Sep 14, 2017

You convinced me that XmlDocument.Load() is slow enough to be on par with the extremely slow disk I/O (especially for GB-sized documents), but still isn't slow enough to render compression worthless as you stated. So your reasoning is not completely wrong – I'm willing to remove the downvote from your answer. But this can be done only after editing it because 2+ days has elapsed. Thank you for the constructive discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment