gfiorav/cartodb_table_management.md

## cartodb_table_management.md

      
    Raw
  

              cartodb_table_management.md
            
          
    Keeping editor tables accurate

In recent weeks I've been tasked with tracking down some issues in CartoDB related to some duplicate tables appearing in some users' dashboards. It was a race condition in a process we call "ghost table linkage", that helps the editor keep up with changes in the user's db.
The issue was fixed implementing a redis lock following the RedLock alghorithm. You can find our implementation (the horrendously named Carto::Bolt) here.
I decided to keep on and try to refactor the whole user db-editor sync a bit. That's what this piece is about: How the CartoDB editor tracks physical table changes in the user's database.
Some background

Throughout this text, I will mention the CartoDB editor and the user database. In order to make sure we're on the same page or incase you're new to the CartoDB architecture, let's run through the basics.
The user's database

CartoDB provides each user with a database where tables can be created, altered, dropped, etc. It's pretty much a standard PostgreSQL database that can be "directly" accessed through the CartoDB SQL API.
The editor workflow

Whenever you use the editor to create, modify or delete a dataset, it those actions into SQL queries that will be applied on that dataset's table. Each dataset is really just a table in the user's database. Well, not exactly: to be manipulated by the editor, the table must first be "cartodbyfied" (which essentially translates to adding some index and special columns).
From now on, we will assume every table in the user's database is cartodbyfied, as we have no business with tables that are not.
How we keep stuff synced

Let's start with some naming conventions:

Physical table: A table in the user's database
User table: A representation of a physical table for the editor (i.e. a dataset)
Table id or oid: A unique id associated to the physical table.

So, I've mentioned that every editor operation over a dataset (or user table) translates to SQL procedure on the corresponding physical table. That's great: If the editor starts the action, it's easy to keep track of what's going on.
However, we also talked about the SQL API and how we could use it to alter physical tables directly without using the editor at all. The question is now inevitable:
"What happens to a user table when it's corresponding physical table is altered or dropped through the SQL API without the editor knowing about it?"
Enter the world of ghost tables.
Ghost tables

The term "ghost table" refers to a user table who's physical table has changed or disappeared resulting in it being "unlinked" from real table. The editor checks for changes between what it knows (user tables) and the reality (physical tables). Such process is known as "ghost table linkage".
This process is launched every time a user accesses the dashboard view. It can run synchronously or in background depending on the type of ghost tables linkage needed.
When a user table will be shown but no physical table corresponds to it and clicking on it will produce errors, we run synchrounsly. While running synchronously, the UI is blocked until the ghost table linkage has finalised.
When the only changes have been the addition of new physical tables, there is no danger of clicking on any stale user table as it hasn't been added yet. In this case the user is free to navigate the UI while ghost tables linkage runs in the background. It might take a moments and a refresh to get the new tables to appear in the editor (remember they must be cartodbfied).
Identifying all scenarios

The goal is for the editor to figure out what the SQL API has done with the physical tables since the last time it checked, so let's see what could have happened:

A physical table might have been dropped (we need to remove it's user table)
A physical table might have been renamed (we need to update it's user table)
A physical table might have been created (we need to add a user table for it)
The SQL API hasn't done anything

The cases are very simple. In reality however, it get's tricky to identify each case. One possible approach is looking at the table ids. They're unique after all right? Let's try:
Id matching

For a given table (user table or physical) with id i:
Take these statements:

UT^id : There's a user table with id i
PT^id : There's a physical table with id i

and these events:

new table: The given table has been created through the SQL API and cartodbfied
renamed: The given table has been renamed through the SQL API
dropped: The given table has been dropped through the SQL API
untouched: Nothing has happened

The truth table is:


UT^id
PT^id


new table
F
T


renamed table
T
T


dropped table
T
F


untouched table
T
T


Oh, no! How do we differentiate between a renamed table and an untouched table? In both cases we can find user and physical tables with matching ids (remember renaming a table will not alter its id). We need more information. Let's try searching both in physical and user tables by id and name:
Joint id and name matching

For a given table (user table or physical) with id i and name n:
Take these statements:

UT^id,n : There's a user table with id i and name n
PT^id,n : There's a physical table with id i and name n

and these events:

new table: The given table has been created through the SQL API and cartodbfied
renamed: The given table has been renamed through the SQL API
dropped: The given table has been dropped through the SQL API
untouched: Nothing has happened

Now the truth table is:


UT^id,n
PT^id,n


new table
F
T


renamed table
F
T


dropped table
T
F


untouched table
T
T


We're in trouble again! Now we can't distinguish new tables from renamed tables. For both cases it's true that we can't find a user table with matching id and name to any physical table.
So, do we have to check for id and name matches separately for user and physical tables? Not really. We can go by mixing for physical tables, as it's were the question is originated (the sql api changed the table and name and id are in sync for it) but we do need to separate name from id for user tables. Basically, we must OR the id, name relation for user tables, but we can AND them for physical tables. Let's try:
Separate name and id matching

For a give table (user table or physical) with id i and name n:
Take these statements:

UT^id : There's a user table with id i
UTⁿ : There's a user table with name n
PT^id,n : There's a physical table with id i and name n

and these events:

new table: The given table has been created through the SQL API and cartodbfied
renamed: The given table has been renamed through the SQL API
dropped: The given table has been dropped through the SQL API
untouched: Nothing has happened

We can write out a truth table where we can identify each scenario according to the statements.


UT^id
UTⁿ
PT^id,n


new table
F
F
T


renamed table
T
F
T


dropped table
T
T
F


untouched table
T
T
T


Let's write each case out:

A new table was created by the SQL API if there's a physical table for which no user table matches neither in id or name.
A table was renamed by the SQL API if there's a physical table for which a user table matches in id, but not in name.
A table was dropped by the SQL API if there's a user table for which no physical table matches in id or name.
The SQL API hasn't done anything if there is no physical table for which there isn't a user table with matching id and name.

So now you know how the editor knows about SQL API activity by comparing user tables to physical tables!
	UT^id	UTⁿ	PT^id,n
new table	F	F	T
renamed table	T	F	T
dropped table	T	T	F
untouched table	T	T	T