Skip to content

Instantly share code, notes, and snippets.

@aaronwolen
Created June 23, 2022 21:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save aaronwolen/8e1dfd987906f0dfa5acb13d15c514c9 to your computer and use it in GitHub Desktop.
Save aaronwolen/8e1dfd987906f0dfa5acb13d15c514c9 to your computer and use it in GitHub Desktop.

Our starting point is raw counts in a sparse matrix, smat (4 genes, 5 cells):

   c1 c2 c3 c4 c5
g1  8  .  .  .  5
g2  .  5  7  .  .
g3  1  .  .  5  .
g4  8  .  .  7  .

As part of our analysis we center/scale the data, resulting in a dense matrix, dmat:

            c1         c2         c3         c4         c5
g1  1.45363114 -0.6998965 -0.6998965 -0.6998965  0.6460583
g2 -0.71395694  0.7734534  1.3684175 -0.7139569 -0.7139569
g3 -0.09225312 -0.5535187 -0.5535187  1.7528093 -0.5535187
g4  1.21267813 -0.7276069 -0.7276069  0.9701425 -0.7276069

Instead of storing all 20 values in dmat, I’m proposing we only store:

            c1        c2       c3        c4        c5
g1  1.45363114         .        .         . 0.6460583
g2           . 0.7734534 1.368417         .         .
g3 -0.09225312         .        . 1.7528093         .
g4  1.21267813         .        . 0.9701425         .

which corresponds to the 8 non-empty coordinates of smat.

Then in the metadata we store the gene/row-specific offsets that were applied to the previously empty cells. Concretely, g1 in smat contained 3 empty values. In dmat those 3 empty values have been replaced with the same offset, -0.6998965. For g2 the offset is -0.71395694, and so on.

On disk dmat would look something like this:

[array]
      i    j           x
1    g1   c1  1.45363114
3    g3   c1 -0.09225312
4    g4   c1  1.21267813
6    g2   c2  0.77345335
10   g2   c3  1.36841747
15   g3   c4  1.75280930
16   g4   c4  0.97014250
17   g1   c5  0.64605828

[metadata]
offsets: -0.6998965, -0.71395694, -0.5535187, -0.7276069

And using this array/metadata we could reconstruct the original matrix, dmat without any loss of information or making any assumptions about how the data was generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment