Hey Will and Others,
Here's a brief summary of what's going on in different parts of DataHub:
All calls to the database are routed through what I'm calling the "command pipeline."
Calls are made to either manager.py
or rlsmanager.py
(that's row level security
manager), which instantiate connection.py
.connection.py
calls pg.py
, which is
where the postgres-specific sql commands are stored.
A datahub application
-> manager.py
/rlsmanager.py
-> connection.py
-> pg.py
DataHub is architected in this way, so that it can support other databases with
minimal effort. For BT, this means writing a new file, oracle.py
, and pointing
connection.py
at it, instead.
- manager.py
- rlsmanager.py [note: this is in an unmerged branch]
- connection.py
- pg.py
Users in DataHub are granted postgres roles and a database.
In the DataHub codebase, it's helpful to define some terms.
repo
== postgres schemarepo_base
== postgres database (which is owned by a user of the same name)- web application
username
== postgres role
User b
grants access to user a
on schema s
.
User a
connects to user b
's repo_base
, and only schema s
is visible.
The GRANT
command happens in postgres, and is also stored in a web application
table, so that the web app knows to make schema s
visible to user a
.
In BT's model, it sounds like users all get schemas, but are working in a large
shared database. Some DH checks for permission are done in manager.py
, and check
to see that repo_base == username
, so integrating with BT would necessitate a
partial rewrite of manager.py
There are four models for exposing data to all users/unauthenticated users
- Users "publish" their entire repo to the internet. In this case, unauthenticated users can select from this repo data through DataHub.
- Users "publish" cards to the internet. In this case, a user authorizes unauthenticated users to execute a pre-defined query. This does not support passing parameters
- Users allow the "ALL" user to operate on rows in their table through row level security (not merged, but I believe that this supports unauthenticated users)
- Users grant repo access to another specified user (see above)
In the first case, the user is actually granting SELECT
access to a special dh_public
role in the postgres. All users in datahub are part of the dh_public
role. Unauthenticated users are logged on using a dh_anonymous
role, which is also part of the dh_public
role, and so they also have access to repos that dh_public
can select from.
Aside from the UX, there's nothing preventing users from also granting INSERT
and UPDATE
to dh_public
. This would allow unauthenticated people to manipulate their data, much like google docs.