Skip to content

Instantly share code, notes, and snippets.

@jamesc127
Last active September 7, 2022 01:02
Show Gist options
  • Save jamesc127/24e457fc52ea07e7f6841caba56f4d95 to your computer and use it in GitHub Desktop.
Save jamesc127/24e457fc52ea07e7f6841caba56f4d95 to your computer and use it in GitHub Desktop.
A Brief Explanation On Cassandra Tombstones

Here is an example of several different kinds of C* tombstones. For reference, they are:

  • Partition
  • Row
  • Cell
  • Range

Intro

The obvious culprit of tombstone creation is DELETE, but there are other - less obvious - sources of the tombstone. Let’s see exacly what happens on disk when a tombstone is created. It's funny to say that a tombstone is created... aren't we deleting things? Remember, everything in C* is a write! A DELETE operation writes another sstable entry with a newer timestamp than all the other entries... and the most recent timestamp wins!

I’ve created a DataStax Studio notebook that complements this gist. You should be able to download it and run it for yourself :-) nodetool and sstabledump commands need to be run from a terminal on the node(s) you're working with.

Let's Go!

Setup

CREATE KEYSPACE IF NOT EXISTS tombstones WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor' : 1};

CREATE TABLE IF NOT EXISTS tombstones.test (
  partitionkey int,
  clusteringcol text,
  string text,
  number int,
  PRIMARY KEY (partitionkey, clusteringcol)
) WITH CLUSTERING ORDER BY (clusteringcol DESC);

INSERT INTO tombstones.test (partitionkey, clusteringcol, string, number) VALUES (1,'clustering1','some text',123);
INSERT INTO tombstones.test (partitionkey, clusteringcol, string, number) VALUES (1,'clustering2','some other text',456);
INSERT INTO tombstones.test (partitionkey, clusteringcol, string, number) VALUES (1,'clustering3','what to write?',789);
INSERT INTO tombstones.test (partitionkey, clusteringcol, string, number) VALUES (1,'clustering4','to be, || ! to be.',0);
INSERT INTO tombstones.test (partitionkey, clusteringcol, string, number) VALUES (2,'clustering1','I am another partition',123);
INSERT INTO tombstones.test (partitionkey, clusteringcol, string, number) VALUES (3,'clustering1','yes, a third partition',456);
INSERT INTO tombstones.test (partitionkey, clusteringcol, string, number) VALUES (2,'clustering2','&&&&&',789);
INSERT INTO tombstones.test (partitionkey, clusteringcol, string, number) VALUES (4,'clustering1','I wonder what a tombstone looks like',0);

sstable Format

Now I have a keyspace, table, and some test data… let’s get to deleting! But hold on a second, what does this data look like on disk in its un-deleted form? For that, we go to sstabledump Take a look at the parameters for sstabledump here We run nodetool flush followed by sstabledump -d /path/to/*Data.db in order to get the following:

[1]@0 Row[info=[ts=1571267799116790] ]: clustering4 | [number=0 ts=1571267799116790], [string=to be, || ! to be. ts=1571267799116790]
[1]@61 Row[info=[ts=1571267799115484] ]: clustering3 | [number=789 ts=1571267799115484], [string=what to write? ts=1571267799115484]
[1]@100 Row[info=[ts=1571267799114163] ]: clustering2 | [number=456 ts=1571267799114163], [string=some other text ts=1571267799114163]
[1]@140 Row[info=[ts=1571267799112078] ]: clustering1 | [number=123 ts=1571267799112078], [string=some text ts=1571267799112078]
[2]@174 Row[info=[ts=1571267799123760] ]: clustering2 | [number=789 ts=1571267799123760], [string=&&&&& ts=1571267799123760]
[2]@222 Row[info=[ts=1571267799118192] ]: clustering1 | [number=123 ts=1571267799118192], [string=I am another partition ts=1571267799118192]
[4]@270 Row[info=[ts=1571267799125237] ]: clustering1 | [number=0 ts=1571267799125237], [string=I wonder what a tombstone looks like ts=1571267799125237]
[3]@350 Row[info=[ts=1571267799121652] ]: clustering1 | [number=456 ts=1571267799121652], [string=yes, a third partition ts=1571267799121652]

Row Tombstone

Let’s start off with a row tombstone. We just want to delete one particular row. Now, this could also be a partition delete if the PRIMARY KEY doesn’t contain any CLUSTERING COLUMN, but our PK does contain a clustering column so we have to be specific.

DELETE FROM tombstones.test WHERE partitionkey = 1 AND clusteringcol = 'clustering1'; //this is a row tombstone

Now we run nodetool flush again (to flush the memtable to sstable so we can get at it), followed by sstabledump /path/to/*Data.db. I’m running sstabledump without the -d switch this time so you can see the json format of the sstable. This will be a recurring theme: I'll use the -d switch when we look at the whole table, and the json format when we want to look closer. In order to save time and be pithy, assume that I run nodetool flush, sstabledump, nodetool compact <ks> <tbl>, then another sstabledump after every operation.

[
  {
    "partition" : {
      "key" : [ "1" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 37,
        "clustering" : [ "clustering1" ],
        "deletion_info" : { "marked_deleted" : "2019-10-16T23:22:25.951117Z", "local_delete_time" : "2019-10-16T23:22:25Z" },
        "cells" : [ ]
      }
    ]
  }
]

Notice that you don't see any of the other cell values here, like number or string. Those values have been removed from this entry, leaving only the empty cells array. Here's the whole table after compaction: everything merged back into a single entry. The json output above and was in its own sstable before we compacted.

[1]@0 Row[info=[ts=1571267799116790] ]: clustering4 | [number=0 ts=1571267799116790], [string=to be, || ! to be. ts=1571267799116790]
[1]@61 Row[info=[ts=1571267799115484] ]: clustering3 | [number=789 ts=1571267799115484], [string=what to write? ts=1571267799115484]
[1]@100 Row[info=[ts=1571267799114163] ]: clustering2 | [number=456 ts=1571267799114163], [string=some other text ts=1571267799114163]
[1]@140 Row[info=[ts=-9223372036854775808] del=deletedAt=1571268145951117, localDeletion=1571268145 ]: clustering1 | #row_tombstone
[2]@164 Row[info=[ts=1571267799123760] ]: clustering2 | [number=789 ts=1571267799123760], [string=&&&&& ts=1571267799123760]
[2]@212 Row[info=[ts=1571267799118192] ]: clustering1 | [number=123 ts=1571267799118192], [string=I am another partition ts=1571267799118192]
[4]@260 Row[info=[ts=1571267799125237] ]: clustering1 | [number=0 ts=1571267799125237], [string=I wonder what a tombstone looks like ts=1571267799125237]
[3]@340 Row[info=[ts=1571267799121652] ]: clustering1 | [number=456 ts=1571267799121652], [string=yes, a third partition ts=1571267799121652]

Partition Tombstone

OK, so that’s a row tombstone. Now let’s see what a partition tombstone looks like…

DELETE FROM tombstones.test WHERE partitionkey = 4; //this is a partition tombstone

Here's the sstable after the flush:

[
  {
    "partition" : {
      "key" : [ "4" ],
      "position" : 0,
      "deletion_info" : { "marked_deleted" : "2019-10-17T01:30:31.259830Z", "local_delete_time" : "2019-10-17T01:30:31Z" }
    },
    "rows" : [ ]
  }
]

Very much like the empty cells array in the row tombstone example, we now have an empty rows array. sstabledump again after compaction:

[1]@0 Row[info=[ts=1571267799116790] ]: clustering4 | [number=0 ts=1571267799116790], [string=to be, || ! to be. ts=1571267799116790]
[1]@61 Row[info=[ts=1571267799115484] ]: clustering3 | [number=789 ts=1571267799115484], [string=what to write? ts=1571267799115484]
[1]@100 Row[info=[ts=1571267799114163] ]: clustering2 | [number=456 ts=1571267799114163], [string=some other text ts=1571267799114163]
[1]@139 Row[info=[ts=-9223372036854775808] del=deletedAt=1571268145951117, localDeletion=1571268145 ]: clustering1 | #row_tombstone
[2]@163 Row[info=[ts=1571267799123760] ]: clustering2 | [number=789 ts=1571267799123760], [string=&&&&& ts=1571267799123760]
[2]@211 Row[info=[ts=1571267799118192] ]: clustering1 | [number=123 ts=1571267799118192], [string=I am another partition ts=1571267799118192]
[4]@259 deletedAt=1571275831259830, localDeletion=1571275831 #partition_tombstone
[3]@278 Row[info=[ts=1571267799121652] ]: clustering1 | [number=456 ts=1571267799121652], [string=yes, a third partition ts=1571267799121652]

Now what are we left with

SELECT * FROM tombstones.test;
partitionkey clusteringcol number string
1 clustering4 0 to be, or ! to be.
1 clustering3 789 what to write?
1 clustering2 456 some other text
2 clustering2 789 &&&&&
2 clustering1 123 I am another partition
3 clustering1 456 yes, a third partition

Cell Tombstone

Moving right along to a cell tombstone. This is where things get a little more interesting… Let’s say I want to remove a specific cell’s value. (For reference, a cell is the value of a column in a specific row)

DELETE string FROM tombstones.test WHERE partitionkey = 3 AND clusteringcol = 'clustering1'; //this is a cell tombstone

Notice that we had to be very specific again in our DELETE statement, since the PRIMARY KEY contains not only the PARTITION KEY but a CLUSTERING COLUMN. Here's the sstabledump where we can see that the pre-compaction sstable doesn't contain the cell value for number, just string, which has now been tombstone'd.

[
  {
    "partition" : {
      "key" : [ "3" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 38,
        "clustering" : [ "clustering1" ],
        "cells" : [
          { "name" : "string", "deletion_info" : { "local_delete_time" : "2019-10-17T01:39:37Z" },
            "tstamp" : "2019-10-17T01:39:37.357675Z"
          }
        ]
      }
    ]
  }
]

And after compaction, all the values are merged together again:

[1]@0 Row[info=[ts=1571267799116790] ]: clustering4 | [number=0 ts=1571267799116790], [string=to be, || ! to be. ts=1571267799116790]
[1]@61 Row[info=[ts=1571267799115484] ]: clustering3 | [number=789 ts=1571267799115484], [string=what to write? ts=1571267799115484]
[1]@100 Row[info=[ts=1571267799114163] ]: clustering2 | [number=456 ts=1571267799114163], [string=some other text ts=1571267799114163]
[1]@139 Row[info=[ts=-9223372036854775808] del=deletedAt=1571268145951117, localDeletion=1571268145 ]: clustering1 | #row_tombstone
[2]@163 Row[info=[ts=1571267799123760] ]: clustering2 | [number=789 ts=1571267799123760], [string=&&&&& ts=1571267799123760]
[2]@211 Row[info=[ts=1571267799118192] ]: clustering1 | [number=123 ts=1571267799118192], [string=I am another partition ts=1571267799118192]
[4]@259 deletedAt=1571275831259830, localDeletion=1571275831 #partition_tombstone
[3]@278 Row[info=[ts=1571267799121652] ]: clustering1 | [number=456 ts=1571267799121652], [string=<tombstone> ts=1571276377357675 ldt=1571276377] #cell_tombstone

Where are we now?

SELECT * FROM tombstones.test;
partitionkey clusteringcol number string
1 clustering4 0 to be, or ! to be.
1 clustering3 789 what to write?
1 clustering2 456 some other text
2 clustering2 789 &&&&&
2 clustering1 123 I am another partition
3 clustering1 456

So now let’s take a look at some other ways we could end up with a cell tombstone. First, let’s INSERT a new row and call it partition 5

INSERT INTO tombstones.test (partitionkey, clusteringcol, string, number) VALUES (5,'clustering1','I have been inserted',9000);

Here's the sstable in json

[
  {
    "partition" : {
      "key" : [ "5" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 62,
        "clustering" : [ "clustering1" ],
        "liveness_info" : { "tstamp" : "2019-10-17T01:53:19.088984Z" },
        "cells" : [
          { "name" : "number", "value" : 9000 },
          { "name" : "string", "value" : "I have been inserted" }
        ]
      }
    ]
  }
]

And again the whole table after flush and compaction

[5]@0 Row[info=[ts=1571277199088984] ]: clustering1 | [number=9000 ts=1571277199088984], [string=I have been inserted ts=1571277199088984]
[1]@67 Row[info=[ts=1571267799116790] ]: clustering4 | [number=0 ts=1571267799116790], [string=to be, || ! to be. ts=1571267799116790]
[1]@128 Row[info=[ts=1571267799115484] ]: clustering3 | [number=789 ts=1571267799115484], [string=what to write? ts=1571267799115484]
[1]@167 Row[info=[ts=1571267799114163] ]: clustering2 | [number=456 ts=1571267799114163], [string=some other text ts=1571267799114163]
[1]@206 Row[info=[ts=-9223372036854775808] del=deletedAt=1571268145951117, localDeletion=1571268145 ]: clustering1 | #row_tombstone
[2]@230 Row[info=[ts=1571267799123760] ]: clustering2 | [number=789 ts=1571267799123760], [string=&&&&& ts=1571267799123760]
[2]@278 Row[info=[ts=1571267799118192] ]: clustering1 | [number=123 ts=1571267799118192], [string=I am another partition ts=1571267799118192]
[4]@326 deletedAt=1571275831259830, localDeletion=1571275831 #partition_tombstone
[3]@345 Row[info=[ts=1571267799121652] ]: clustering1 | [number=456 ts=1571267799121652], [string=<tombstone> ts=1571276377357675 ldt=1571276377] #cell_tombstone

What about if I want to UPDATE one of the values in the row I just INSERT'd? We have to be very specific:

UPDATE tombstones.test SET string = 'I have been upserted' WHERE partitionkey = 5 AND clusteringcol = 'clustering1';

sstable

[
  {
    "partition" : {
      "key" : [ "5" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 58,
        "clustering" : [ "clustering1" ],
        "cells" : [
          { "name" : "string", "value" : "I have been upserted", "tstamp" : "2019-10-17T01:57:41.012018Z" }
        ]
      }
    ]
  }
]

Note that the only value that was written in this sstable was "I have been upserted". The number column value is absent. But when we SELECT:

SELECT * FROM tombstones.test WHERE partitionkey = 5;
partitionkey clusteringcol number string
5 clustering1 9000 I have been upserted

...the new string value is returned with the old number value! Again, after compaction

[5]@0 Row[info=[ts=1571277199088984] ]: clustering1 | [number=9000 ts=1571277199088984], [string=I have been upserted ts=1571277461012018] #merged back together!
[1]@72 Row[info=[ts=1571267799116790] ]: clustering4 | [number=0 ts=1571267799116790], [string=to be, || ! to be. ts=1571267799116790]
[1]@133 Row[info=[ts=1571267799115484] ]: clustering3 | [number=789 ts=1571267799115484], [string=what to write? ts=1571267799115484]
[1]@172 Row[info=[ts=1571267799114163] ]: clustering2 | [number=456 ts=1571267799114163], [string=some other text ts=1571267799114163]
[1]@211 Row[info=[ts=-9223372036854775808] del=deletedAt=1571268145951117, localDeletion=1571268145 ]: clustering1 |
[2]@235 Row[info=[ts=1571267799123760] ]: clustering2 | [number=789 ts=1571267799123760], [string=&&&&& ts=1571267799123760]
[2]@283 Row[info=[ts=1571267799118192] ]: clustering1 | [number=123 ts=1571267799118192], [string=I am another partition ts=1571267799118192]
[4]@331 deletedAt=1571275831259830, localDeletion=1571275831
[3]@350 Row[info=[ts=1571267799121652] ]: clustering1 | [number=456 ts=1571267799121652], [string=<tombstone> ts=1571276377357675 ldt=1571276377]

The same works for INSERT when we leave a column unset. Let's use string in this case:

INSERT INTO tombstones.test (partitionkey, clusteringcol, number) VALUES (5,'clustering1',8888);

sstable

[
  {
    "partition" : {
      "key" : [ "5" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 41,
        "clustering" : [ "clustering1" ],
        "liveness_info" : { "tstamp" : "2019-10-17T02:03:47.755945Z" },
        "cells" : [
          { "name" : "number", "value" : 8888 }
        ]
      }
    ]
  }
]

Similarly to the UPDATE, the only value that was written to the sstable was 8888. The string column value is absent this time. SELECT before compaction:

SELECT * FROM tombstones.test WHERE partitionkey = 5;
partitionkey clusteringcol number string
5 clustering1 8888 I have been upserted

Now the string column is the same, but the number is the new value!

[5]@0 Row[info=[ts=1571277827755945] ]: clustering1 | [number=8888 ts=1571277827755945], [string=I have been upserted ts=1571277461012018]
[1]@72 Row[info=[ts=1571267799116790] ]: clustering4 | [number=0 ts=1571267799116790], [string=to be, || ! to be. ts=1571267799116790]
[1]@133 Row[info=[ts=1571267799115484] ]: clustering3 | [number=789 ts=1571267799115484], [string=what to write? ts=1571267799115484]
[1]@172 Row[info=[ts=1571267799114163] ]: clustering2 | [number=456 ts=1571267799114163], [string=some other text ts=1571267799114163]
[1]@211 Row[info=[ts=-9223372036854775808] del=deletedAt=1571268145951117, localDeletion=1571268145 ]: clustering1 |
[2]@235 Row[info=[ts=1571267799123760] ]: clustering2 | [number=789 ts=1571267799123760], [string=&&&&& ts=1571267799123760]
[2]@283 Row[info=[ts=1571267799118192] ]: clustering1 | [number=123 ts=1571267799118192], [string=I am another partition ts=1571267799118192]
[4]@331 deletedAt=1571275831259830, localDeletion=1571275831
[3]@350 Row[info=[ts=1571267799121652] ]: clustering1 | [number=456 ts=1571267799121652], [string=<tombstone> ts=1571276377357675 ldt=1571276377]

Let's say I explicitly want to remove the number value, and at the same time change the string using an INSERT:

INSERT INTO tombstones.test (partitionkey, clusteringcol, string) VALUES (5,'clustering1','did the number get deleted?');

sstabledump

[
  {
    "partition" : {
      "key" : [ "5" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 65,
        "clustering" : [ "clustering1" ],
        "liveness_info" : { "tstamp" : "2019-10-17T02:10:03.024349Z" },
        "cells" : [
          { "name" : "string", "value" : "did the number get deleted?" }
        ]
      }
    ]
  }
]

I don't see the value for number anywhere after a flush...

SELECT * FROM tombstones.test WHERE partitionkey = 5; //uh oh
partitionkey clusteringcol number string
5 clustering1 8888 did the number get deleted?

Our previous number value is still there! How can we ensure that we're DELETE on the number value, UPDATE'ing the string, all while using an INSERT statement? (have we all gone mad?!?!?)

Cell Tombstone with null

INSERT INTO tombstones.test (partitionkey, clusteringcol, string, number) VALUES (5,'clustering1','now the number is gone...',null); //this is a cell tombstone

sstable after nodetool flush

[
  {
    "partition" : {
      "key" : [ "5" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 64,
        "clustering" : [ "clustering1" ],
        "liveness_info" : { "tstamp" : "2019-10-17T02:12:27.380989Z" },
        "cells" : [
          { "name" : "number", "deletion_info" : { "local_delete_time" : "2019-10-17T02:12:27Z" }
          },
          { "name" : "string", "value" : "now the number is gone..." }
        ]
      }
    ]
  }
]

sstable after compaction:

[5]@0 Row[info=[ts=1571278347380989] ]: clustering1 | [number=<tombstone> ts=1571278347380989 ldt=1571278347], [string=now the number is gone... ts=1571278347380989] #cell_tombstone
[1]@70 Row[info=[ts=1571267799116790] ]: clustering4 | [number=0 ts=1571267799116790], [string=to be, || ! to be. ts=1571267799116790]
[1]@131 Row[info=[ts=1571267799115484] ]: clustering3 | [number=789 ts=1571267799115484], [string=what to write? ts=1571267799115484]
[1]@170 Row[info=[ts=1571267799114163] ]: clustering2 | [number=456 ts=1571267799114163], [string=some other text ts=1571267799114163]
[1]@209 Row[info=[ts=-9223372036854775808] del=deletedAt=1571268145951117, localDeletion=1571268145 ]: clustering1 | #row_tombstone
[2]@233 Row[info=[ts=1571267799123760] ]: clustering2 | [number=789 ts=1571267799123760], [string=&&&&& ts=1571267799123760]
[2]@281 Row[info=[ts=1571267799118192] ]: clustering1 | [number=123 ts=1571267799118192], [string=I am another partition ts=1571267799118192]
[4]@329 deletedAt=1571275831259830, localDeletion=1571275831 #partition_tombstone
[3]@348 Row[info=[ts=1571267799121652] ]: clustering1 | [number=456 ts=1571267799121652], [string=<tombstone> ts=1571276377357675 ldt=1571276377] #cell_tombstone
SELECT * FROM tombstones.test WHERE partitionkey = 5; //now it's really gone
partitionkey clusteringcol number string
5 clustering1 now the number is gone...

Range Tombstone

The last type of tombstone that we’ll look at is the range tombstone. This can be very useful when deleting multiple rows from within the same partition. Instead of deleting each row individually, you will have one tombstone at the upper bound and another at the lower. Let’s watch...

CREATE TABLE IF NOT EXISTS tombstones.range (
  partitionkey int,
  clusteringcol int,
  string text,
  PRIMARY KEY (partitionkey, clusteringcol)
) WITH CLUSTERING ORDER BY (clusteringcol DESC);

INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (1,1,'first');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (1,2,'second');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (1,3,'third');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (1,4,'fourth');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (1,5,'fifth');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (1,6,'sixth');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (1,7,'seventh');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (1,8,'eighth');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (1,9,'ninth');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (1,10,'tenth');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (2,1,'first');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (2,2,'second');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (2,3,'third');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (2,4,'fourth');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (2,5,'fifth');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (2,6,'sixth');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (2,7,'seventh');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (2,8,'eighth');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (2,9,'ninth');
INSERT INTO tombstones.range (partitionkey,clusteringcol,string) VALUES (2,10,'tenth');

sstable

[1]@0 Row[info=[ts=1571278763051320] ]: 10 | [string=tenth ts=1571278763051320]
[1]@35 Row[info=[ts=1571278763050489] ]: 9 | [string=ninth ts=1571278763050489]
[1]@52 Row[info=[ts=1571278763049568] ]: 8 | [string=eighth ts=1571278763049568]
[1]@70 Row[info=[ts=1571278763048579] ]: 7 | [string=seventh ts=1571278763048579]
[1]@89 Row[info=[ts=1571278763047440] ]: 6 | [string=sixth ts=1571278763047440]
[1]@106 Row[info=[ts=1571278763046122] ]: 5 | [string=fifth ts=1571278763046122]
[1]@123 Row[info=[ts=1571278763044830] ]: 4 | [string=fourth ts=1571278763044830]
[1]@141 Row[info=[ts=1571278763043442] ]: 3 | [string=third ts=1571278763043442]
[1]@158 Row[info=[ts=1571278763042178] ]: 2 | [string=second ts=1571278763042178]
[1]@176 Row[info=[ts=1571278763040463] ]: 1 | [string=first ts=1571278763040463]
[2]@193 Row[info=[ts=1571278763061688] ]: 10 | [string=tenth ts=1571278763061688]
[2]@229 Row[info=[ts=1571278763060188] ]: 9 | [string=ninth ts=1571278763060188]
[2]@247 Row[info=[ts=1571278763058940] ]: 8 | [string=eighth ts=1571278763058940]
[2]@266 Row[info=[ts=1571278763058001] ]: 7 | [string=seventh ts=1571278763058001]
[2]@286 Row[info=[ts=1571278763057052] ]: 6 | [string=sixth ts=1571278763057052]
[2]@304 Row[info=[ts=1571278763056174] ]: 5 | [string=fifth ts=1571278763056174]
[2]@321 Row[info=[ts=1571278763055216] ]: 4 | [string=fourth ts=1571278763055216]
[2]@339 Row[info=[ts=1571278763054278] ]: 3 | [string=third ts=1571278763054278]
[2]@356 Row[info=[ts=1571278763053360] ]: 2 | [string=second ts=1571278763053360]
[2]@374 Row[info=[ts=1571278763052304] ]: 1 | [string=first ts=1571278763052304]

The syntax for a range DELETE specifies the PARTITION KEY, and uses conditional operators for the CLUSTERING COLUMN(s)

DELETE FROM tombstones.range WHERE partitionkey = 1 AND clusteringcol > 3 AND clusteringcol < 8;

sstable

[
  {
    "partition" : {
      "key" : [ "1" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "range_tombstone_bound",
        "start" : {
          "type" : "exclusive",
          "clustering" : [ 8 ],
          "deletion_info" : { "marked_deleted" : "2019-10-17T02:24:18.898466Z", "local_delete_time" : "2019-10-17T02:24:18Z" }
        }
      },
      {
        "type" : "range_tombstone_bound",
        "end" : {
          "type" : "exclusive",
          "clustering" : [ 3 ],
          "deletion_info" : { "marked_deleted" : "2019-10-17T02:24:18.898466Z", "local_delete_time" : "2019-10-17T02:24:18Z" }
        }
      }
    ]
  }
]
SELECT * FROM tombstones.range WHERE partitionkey = 1;
partitionkey clusteringcol string
1 10 tenth
1 9 ninth
1 8 eighth
1 3 third
1 2 second
1 1 first

After compaction... Viola!

[1]@0 Row[info=[ts=1571278763051320] ]: 10 | [string=tenth ts=1571278763051320]
[1]@35 Row[info=[ts=1571278763050489] ]: 9 | [string=ninth ts=1571278763050489]
[1]@52 Row[info=[ts=1571278763049568] ]: 8 | [string=eighth ts=1571278763049568]
[1]@70 Marker EXCL_START_BOUND(8)@1571279058898466/1571279058
[1]@87 Marker EXCL_END_BOUND(3)@1571279058898466/1571279058
[1]@104 Row[info=[ts=1571278763043442] ]: 3 | [string=third ts=1571278763043442]
[1]@121 Row[info=[ts=1571278763042178] ]: 2 | [string=second ts=1571278763042178]
[1]@139 Row[info=[ts=1571278763040463] ]: 1 | [string=first ts=1571278763040463]
[2]@156 Row[info=[ts=1571278763061688] ]: 10 | [string=tenth ts=1571278763061688]
[2]@192 Row[info=[ts=1571278763060188] ]: 9 | [string=ninth ts=1571278763060188]
[2]@210 Row[info=[ts=1571278763058940] ]: 8 | [string=eighth ts=1571278763058940]
[2]@229 Row[info=[ts=1571278763058001] ]: 7 | [string=seventh ts=1571278763058001]
[2]@249 Row[info=[ts=1571278763057052] ]: 6 | [string=sixth ts=1571278763057052]
[2]@267 Row[info=[ts=1571278763056174] ]: 5 | [string=fifth ts=1571278763056174]
[2]@284 Row[info=[ts=1571278763055216] ]: 4 | [string=fourth ts=1571278763055216]
[2]@302 Row[info=[ts=1571278763054278] ]: 3 | [string=third ts=1571278763054278]
[2]@319 Row[info=[ts=1571278763053360] ]: 2 | [string=second ts=1571278763053360]
[2]@337 Row[info=[ts=1571278763052304] ]: 1 | [string=first ts=1571278763052304]

Partition Tombstone of Range Delete

How about just one more partition tombstone for the road? Here's the full json output of sstabledump of our range example's partition 2 before the DELETE:

{
    "partition" : {
      "key" : [ "2" ],
      "position" : 156
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 192,
        "clustering" : [ 10 ],
        "liveness_info" : { "tstamp" : "2019-10-17T02:19:23.061688Z" },
        "cells" : [
          { "name" : "string", "value" : "tenth" }
        ]
      },
      {
        "type" : "row",
        "position" : 192,
        "clustering" : [ 9 ],
        "liveness_info" : { "tstamp" : "2019-10-17T02:19:23.060188Z" },
        "cells" : [
          { "name" : "string", "value" : "ninth" }
        ]
      },
      {
        "type" : "row",
        "position" : 210,
        "clustering" : [ 8 ],
        "liveness_info" : { "tstamp" : "2019-10-17T02:19:23.058940Z" },
        "cells" : [
          { "name" : "string", "value" : "eighth" }
        ]
      },
      {
        "type" : "row",
        "position" : 229,
        "clustering" : [ 7 ],
        "liveness_info" : { "tstamp" : "2019-10-17T02:19:23.058001Z" },
        "cells" : [
          { "name" : "string", "value" : "seventh" }
        ]
      },
      {
        "type" : "row",
        "position" : 249,
        "clustering" : [ 6 ],
        "liveness_info" : { "tstamp" : "2019-10-17T02:19:23.057052Z" },
        "cells" : [
          { "name" : "string", "value" : "sixth" }
        ]
      },
      {
        "type" : "row",
        "position" : 267,
        "clustering" : [ 5 ],
        "liveness_info" : { "tstamp" : "2019-10-17T02:19:23.056174Z" },
        "cells" : [
          { "name" : "string", "value" : "fifth" }
        ]
      },
      {
        "type" : "row",
        "position" : 284,
        "clustering" : [ 4 ],
        "liveness_info" : { "tstamp" : "2019-10-17T02:19:23.055216Z" },
        "cells" : [
          { "name" : "string", "value" : "fourth" }
        ]
      },
      {
        "type" : "row",
        "position" : 302,
        "clustering" : [ 3 ],
        "liveness_info" : { "tstamp" : "2019-10-17T02:19:23.054278Z" },
        "cells" : [
          { "name" : "string", "value" : "third" }
        ]
      },
      {
        "type" : "row",
        "position" : 319,
        "clustering" : [ 2 ],
        "liveness_info" : { "tstamp" : "2019-10-17T02:19:23.053360Z" },
        "cells" : [
          { "name" : "string", "value" : "second" }
        ]
      },
      {
        "type" : "row",
        "position" : 337,
        "clustering" : [ 1 ],
        "liveness_info" : { "tstamp" : "2019-10-17T02:19:23.052304Z" },
        "cells" : [
          { "name" : "string", "value" : "first" }
        ]
      }
    ]
  }

buh bye...

DELETE FROM tombstones.range WHERE partitionkey = 2; //this is a partition tombstone, just for fun :-)

sstabledump after:

[
  {
    "partition" : {
      "key" : [ "2" ],
      "position" : 0,
      "deletion_info" : { "marked_deleted" : "2019-10-17T02:27:46.045375Z", "local_delete_time" : "2019-10-17T02:27:46Z" }
    },
    "rows" : [ ]
  }
]

Somehow that's very satisfying. If you've made it this far down my soliloquy, thanks! Hopefully it's been helpful. E-mail me @ james.colvin@datastax.com if you have any questions or feedback.

success

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment