aaronwolen/scale-data-storage-proposal.md

## scale-data-storage-proposal.md

      
    Raw
  

              scale-data-storage-proposal.md
            
          
    Our starting point is raw counts in a sparse matrix, smat (4 genes, 5 cells):
   c1 c2 c3 c4 c5
g1  8  .  .  .  5
g2  .  5  7  .  .
g3  1  .  .  5  .
g4  8  .  .  7  .

As part of our analysis we center/scale the data, resulting in a dense matrix, dmat:
            c1         c2         c3         c4         c5
g1  1.45363114 -0.6998965 -0.6998965 -0.6998965  0.6460583
g2 -0.71395694  0.7734534  1.3684175 -0.7139569 -0.7139569
g3 -0.09225312 -0.5535187 -0.5535187  1.7528093 -0.5535187
g4  1.21267813 -0.7276069 -0.7276069  0.9701425 -0.7276069

Instead of storing all 20 values in dmat, I’m proposing we only store:
            c1        c2       c3        c4        c5
g1  1.45363114         .        .         . 0.6460583
g2           . 0.7734534 1.368417         .         .
g3 -0.09225312         .        . 1.7528093         .
g4  1.21267813         .        . 0.9701425         .

which corresponds to the 8 non-empty coordinates of smat.
Then in the metadata we store the gene/row-specific offsets that were applied to the previously empty cells. Concretely, g1 in smat contained 3 empty values. In dmat those 3 empty values have been replaced with the same offset, -0.6998965. For g2 the offset is -0.71395694, and so on.
On disk dmat would look something like this:
[array]
      i    j           x
1    g1   c1  1.45363114
3    g3   c1 -0.09225312
4    g4   c1  1.21267813
6    g2   c2  0.77345335
10   g2   c3  1.36841747
15   g3   c4  1.75280930
16   g4   c4  0.97014250
17   g1   c5  0.64605828

[metadata]
offsets: -0.6998965, -0.71395694, -0.5535187, -0.7276069

And using this array/metadata we could reconstruct the original matrix, dmat without any loss of information or making any assumptions about how the data was generated.