A common problem in software development and research is the "do-something"-"save data" loop. Often we are saving structured data over and over again, and this document looks at the fastest way to do this.
I will not consider this because:
- how do we combine these files later? This just kicks the can down the road
- This can create potentially millions of tiny files on your computer, and quickly crash your filesystem
If I knew the expected size of files ahead of time, it's a trivial problem. I'm going to assume that we don't know how big our data is going to be, but we need something that works for small datasets but can also scale with zero overhead to extremely large datasets.
This is straightforward, this is what a beginner might do. It looks like this:
clear all
RandStream.setGlobalStream(RandStream('mt19937ar','Seed',1984));
a = randn(1e7,1);
b = randn(1e5,1);
tic
save('test_base.mat','a','-nocompression')
t = toc;
disp(['Saving a uncompressed .mat took ' mat2str(t,2) ' seconds'])
tic
load('test_base.mat','a')
a = [a;b];
save('test_base.mat','a','-nocompression')
t = toc;
disp(['Adding to a uncompressed .mat took ' mat2str(t,2) ' seconds'])
and this gives us:
Saving a uncompressed .mat took 0.1 seconds
Adding to a uncompressed .mat took 0.25 seconds
There are several disadvantages:
- We need to load the data into memory
- We need to resize the data in memory
- We need to overwrite the data, which is slow. This step is critical since we will be doing this over and over
RandStream.setGlobalStream(RandStream('mt19937ar','Seed',1984));
a = randn(1e7,1);
b = randn(1e5,1);
tic
save('test_ascii.mat','a','-ascii')
t = toc;
disp(['Saving to ASCII took ' mat2str(t,2) ' seconds'])
% append
tic
save('test_ascii.mat','b','-append','-ascii')
t = toc;
disp(['Adding to ASCII took ' mat2str(t,2) ' seconds'])
and this results in:
Saving to ASCII took 3.5 seconds
Adding to ASCII took 0.04 seconds
The initial save is very slow, but subsquent saves are MUCH faster, since we're simply adding lines to a text file
We can use low-level functions (fwrite
and fopen
in MATLAB) to write the data directly to disk as follows:
tic
f = fopen('low_level.bin','w');
fwrite(f,a,'double');
fclose(f);
t = toc;
disp(['Saving a binary took ' mat2str(t,2) ' seconds'])
tic
f = fopen('low_level.bin','a');
fwrite(f,b,'double');
fclose(f);
t = toc;
disp(['Adding to binary took ' mat2str(t,2) ' seconds'])
and this is the fastest:
Saving a binary took 0.11 seconds
Adding to binary took 0.0015 seconds