yashbonde/fs_on_pg.md

## fs_on_pg.md

      
    Raw
  

              fs_on_pg.md
            
          
    File System UX on Postgres

What are the properties of file system UX and how we implemented it in postgres.
Before we begin, here's the table SQL
CREATE TABLE documents (
	id text NOT NULL,
	collection_id text NULL,
	created_at timestamptz NULL,
	updated_at timestamptz NULL,
	deleted_at timestamptz NULL,
	url text NULL,
	"type" text NULL,
	size_bytes int8 NULL,
	"name" text NULL,
	"path" ltree NULL,
	status text NULL DEFAULT 'pending'::text,
	status_text text NULL,
	nbx_bucket bool NULL DEFAULT false,
	size_k_b int8 NULL,
	metadata jsonb NULL,
	CONSTRAINT documents_pkey PRIMARY KEY (id),
	CONSTRAINT fk_documents_collection FOREIGN KEY (collection_id) REFERENCES public.collections(id)
);
Remember:

file systems are glorified key:value pairs
we give meaning to key by spltting around /

Uniqueness

Have you ever seen two files with the same path? Even the idea feels weird, that is because all files are unique. However in our flat tables we can have any number of same name because our primary is id. So we created a unique index on (collection_id, name) but what happens when a soft delete happens and we try to create a file with the same name. In this case the logic will fail, so we have to add deleted_at as a column in the unique command:
ALTER TABLE public.documents
ADD CONSTRAINT
  uq_documents_deleted_collection_name unique NULLS NOT DISTINCT (deleted_at, collection_id, name);
See the big NULLS NOT DISTINCT, that is because before Postgres 15 nulls were treated differently, read stackoverflow. So we upgraded out Postgres to v16 and fixed this.
Now you can keep on deleting and keep creating same file again and again.
Heirarchy

File system use the simple / split key value. Filesystems show folders and files both really fast, a simple ls -la shows you this:
-rw-r--r--   1 yashbonde  staff    480 Sep 20 17:22 .env
-rw-r--r--   1 yashbonde  staff    190 Sep 19 19:08 README.md
drwxr-xr-x   5 yashbonde  staff    160 Sep 18 12:30 armoury
where the d at the start tells that it is a directory. This can then be brought to life in a UI that contains folders and files as icons. This means that a simple list-table kinda structure would need to be modified for filepaths.
Assume we have the file structure like this:
├── a
│   ├── b.pdf
│   └── c.pdf
├── d.pdf
├── e.txt
└── f
│   ├── g.txt
│   └── h
│       └── i.pdf

One method is to use LIKE operator, read stackoverflow. But it solves only one part of the problem which is the start. Imagine by some sorting order these are the files in the table:
a/b.pdf
e.txt
d.pdf
a/c.pdf
f/h/i.pdf
f/g.pdf

If you do an LIKE with a/ as prefix you get [a/b.pdf, a/c.pdf]
What if I ask you what are all the folders here the answer should be [a, f] and if I ask at prefix=f then should be [h]. This is a very useful addition to the file system UI. Doing this with LIKE is almost impossible to get right.
Solution

We use an extension called ltree, read pg doc. This allows us to store data in a herarchical format so we can do complex queries like an actual prefix match. We need to create a new index for this:
CREATE EXTENSION IF NOT EXISTS ltree;
ADD CONSTRAINT uq_documents_deleted_collection_name unique NULLS NOT DISTINCT (deleted_at, collection_id, name);
create index tree_path_idx on public.documents using gist ("path");

Without going too much in implementation:

we add a concept of root folder _r that is added as the fist element in the path
we get speed ups over doing LIKE where all the strings would be matched in all the rows. There is potential for funny business with offset and limit where you might end up interating over the entire dataset to find matches. so you can get all the files inside a folder with query like:

select *
FROM public.documents d
where
	d.path = '_r.baz'
offset 0
limit 4;

folders became an absolute breeze with powerful queries which can group and aggregate complex information about folders which would have been impossible without ltree:

select
	replace(ltree2text(d.path), '_r.', '') as folder_name,
	COUNT(*) as file_count,
	SUM(d.size_bytes) as indexed_size,
	SUM(d.hit_count) as hits,
FROM public.documents d
where
	d.path <@ '_r' and
	nlevel(d.path) = 2
group by d.path;
Result

A single API:

With these you should be able to build all the possible reasonable file system UIs:

icons view
list view
gallery: More information with creative covers can be shown
columns (nested-list view): different list are shown for different folders, depth is towards the right (depth in armoury is locked to 4 folders. So UI can be built with that assumption)