Skip to content

Instantly share code, notes, and snippets.

@RogerTangos
Last active May 16, 2016 15:44
Show Gist options
  • Save RogerTangos/8bbe792bb81c480d8d30ca10c230b7ce to your computer and use it in GitHub Desktop.
Save RogerTangos/8bbe792bb81c480d8d30ca10c230b7ce to your computer and use it in GitHub Desktop.
DataHub BT Writeup

Hey Will and Others,

Here's a brief summary of what's going on in different parts of DataHub:

The command pipeline:

All calls to the database are routed through what I'm calling the "command pipeline."

Calls are made to either manager.py or rlsmanager.py (that's row level security manager), which instantiate connection.py.connection.py calls pg.py, which is where the postgres-specific sql commands are stored.

A datahub application -> manager.py/rlsmanager.py -> connection.py -> pg.py

DataHub is architected in this way, so that it can support other databases with minimal effort. For BT, this means writing a new file, oracle.py, and pointing connection.py at it, instead.

Account Management & Repo Collaboration

Users in DataHub are granted postgres roles and a database.

In the DataHub codebase, it's helpful to define some terms.

  • repo == postgres schema
  • repo_base == postgres database (which is owned by a user of the same name)
  • web application username == postgres role

User b grants access to user a on schema s. User a connects to user b's repo_base, and only schema s is visible.

The GRANT command happens in postgres, and is also stored in a web application table, so that the web app knows to make schema s visible to user a.

In BT's model, it sounds like users all get schemas, but are working in a large shared database. Some DH checks for permission are done in manager.py, and check to see that repo_base == username, so integrating with BT would necessitate a partial rewrite of manager.py

Public Repos and Unauthenticated Users

There are four models for exposing data to all users/unauthenticated users

  1. Users "publish" their entire repo to the internet. In this case, unauthenticated users can select from this repo data through DataHub.
  2. Users "publish" cards to the internet. In this case, a user authorizes unauthenticated users to execute a pre-defined query. This does not support passing parameters
  3. Users allow the "ALL" user to operate on rows in their table through row level security (not merged, but I believe that this supports unauthenticated users)
  4. Users grant repo access to another specified user (see above)

In the first case, the user is actually granting SELECT access to a special dh_public role in the postgres. All users in datahub are part of the dh_public role. Unauthenticated users are logged on using a dh_anonymous role, which is also part of the dh_public role, and so they also have access to repos that dh_public can select from.

Aside from the UX, there's nothing preventing users from also granting INSERT and UPDATE to dh_public. This would allow unauthenticated people to manipulate their data, much like google docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment