Skip to content

Instantly share code, notes, and snippets.

@sg-s
Last active January 28, 2022 13:32
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save sg-s/6e035d91bbfa33f855e10e6918d1700e to your computer and use it in GitHub Desktop.
Save sg-s/6e035d91bbfa33f855e10e6918d1700e to your computer and use it in GitHub Desktop.

A common problem in software development and research is the "do-something"-"save data" loop. Often we are saving structured data over and over again, and this document looks at the fastest way to do this.

Solutions I will not consider

Writing each chunk of data to its own file

I will not consider this because:

  1. how do we combine these files later? This just kicks the can down the road
  2. This can create potentially millions of tiny files on your computer, and quickly crash your filesystem

Pre-allocating data/space in files

If I knew the expected size of files ahead of time, it's a trivial problem. I'm going to assume that we don't know how big our data is going to be, but we need something that works for small datasets but can also scale with zero overhead to extremely large datasets.

Possible solutions

Simple load-and-save

This is straightforward, this is what a beginner might do. It looks like this:

clear all

RandStream.setGlobalStream(RandStream('mt19937ar','Seed',1984)); 
a = randn(1e7,1);
b = randn(1e5,1);

tic
save('test_base.mat','a','-nocompression')
t = toc;
disp(['Saving a uncompressed .mat took ' mat2str(t,2) ' seconds'])


tic
load('test_base.mat','a')
a = [a;b];
save('test_base.mat','a','-nocompression')
t = toc;
disp(['Adding to a uncompressed .mat took ' mat2str(t,2) ' seconds'])

and this gives us:

Saving a uncompressed .mat took 0.1 seconds
Adding to a uncompressed .mat took 0.25 seconds

There are several disadvantages:

  1. We need to load the data into memory
  2. We need to resize the data in memory
  3. We need to overwrite the data, which is slow. This step is critical since we will be doing this over and over

Saving as ASCII text files

RandStream.setGlobalStream(RandStream('mt19937ar','Seed',1984)); 
a = randn(1e7,1);
b = randn(1e5,1);

tic
save('test_ascii.mat','a','-ascii')
t = toc;
disp(['Saving to ASCII took ' mat2str(t,2) ' seconds'])


% append 
tic
save('test_ascii.mat','b','-append','-ascii')
t = toc;
disp(['Adding to ASCII took ' mat2str(t,2) ' seconds'])

and this results in:

Saving to ASCII took 3.5 seconds
Adding to ASCII took 0.04 seconds

The initial save is very slow, but subsquent saves are MUCH faster, since we're simply adding lines to a text file

Using low-level functions to write a binary data stream

We can use low-level functions (fwrite and fopen in MATLAB) to write the data directly to disk as follows:

tic
f = fopen('low_level.bin','w');
fwrite(f,a,'double');
fclose(f);
t = toc;
disp(['Saving a binary took ' mat2str(t,2) ' seconds'])


tic
f = fopen('low_level.bin','a');
fwrite(f,b,'double');
fclose(f);
t = toc;
disp(['Adding to binary took ' mat2str(t,2) ' seconds'])

and this is the fastest:

Saving a binary took 0.11 seconds
Adding to binary took 0.0015 seconds
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment