Skip to content

Instantly share code, notes, and snippets.

@seanh seanh/usernames.md Secret
Last active Jul 27, 2016

Embed
What would you like to do?
Review of usernames and userids in h

Review of usernames and userids in h

Terms

  • "username" means the short version, "Sean.H"
  • "uid" means the normalized version of the username, e.g. "seanh" (if you didn't use any .'s or upper-case characters in your username then your uid will be the same as your username)
  • A user's "id" is their auto incrementing integer primary key in the db, e.g. 3324
  • "userid" means the full version of the username including the domain part, "acct:someone@hypothes.is"

The problem

A third-party account provider sends us a username and we need to create an account for that username (if one doesn't already exist) and return a token that can be used to authenticate requests as that account.

So we now need to be able to have acct:someone@somewhere_else user accounts in our db and have those accounts work with our client, website, etc as well as having acct:someone@hypothes.is accounts at the same time.

In fact we need to be able to have two accounts with the same username, as long as the @domain suffix is different. (We also need two accounts with different domains to be able to have the same email address.)

Some parts of the Hypothesis code use fully-qualified userids like acct:someone@hypothes.is, which is good. Other parts of the code use just the username part, someone, and it's implied that that means someone@hypothes.is, in most places that isn't good and won't work when we actually have third-party accounts.

Also we can't control what the third-party account provider sends us as much as we can control what accounts our own user's register using our own form. For example we can't restrict what characters are allowed in the username, or the length, etc. And we may not have a password or email address for the account.

Suggested solution

There are more fine-grained list of potential tasks below but in broad strokes my suggestion is:

  1. Add the authority part as a new column on the user table

  2. Change username unique constraint to (username, authority)

  3. Change email unique constraint to (email, authority), or we may have to remove it entirely (at least for non-hypothes.is accounts) because third-party account providers may allow different accounts to have the same email

  4. Remove lots of restrictions on user accounts that are currently implemented in the database or model code (that they must have a password, that they must have an email address, length limits on passwords and usernames, allowed characters in usernames...) or make them apply to @hypothes.is users only

  5. Make the code use fully qualified acct:someone@hypothes.is userids everywhere except at the edges.

    When a user logs in or registers using our forms we're not going to make them type someone@hypothes.is, they just type someone, but we change that to acct:someone@hypothes.is asap before sending it elsewhere in the code.

    Similarly when a third-party account provider sends us an authorization grant it's probably going to have a "seanh" username and not one of the form "acct:seanh@third-party.com", but we transform it into that asap.

    In other places we do want to use the full userids as input, for example when looking up or nipsaing users on the admin pages we don't want to be restricted to only hypothes.is users.

    On the way out, in many places when we're displaying usernames we probably just display "seanh" but this denormalization should happen as late as possible (and probably not at all in some outputs, like in API responses).

  6. We need to allow and deal with @ characters in usernames, which are a special problem

Suggested tasks

This is an attempt at a list of Trello card / pull request-sized tasks that need to get done.

Some of these we may not actually need to do for our first round of third-party account providers (for example, if we know that their usernames don't have @'s in them then we don't need to deal with that).

Some of these we may not need to do for third-party accounts v1 (e.g. if it's not going support session promotion yet),

Some of these are probably just wrong.

And there's definitely some missing!

But hopefully this list will be some kind of useful start:

  • Add authority column to user table, fill existing and new rows with "hypothes.is".

  • Enable creating new users with an authority other than "hypothes.is". Of course it won't actually be possible to do this using our register form. But there should be code that can be called from a test that can do it.

  • Enable creating pre-activated users. When we automatically create an @third-party.com user account we don't want the user to then have to activate the account (and we won't necessarily have an email address for the user, so we might not be able to send an activation email)

  • Change username unique constraint to (username, authority), change uid unique constraint to (uid, authority), fix User.get_by_username() to work when usernames and uids are no longer unique (see below).

  • Change email unique constraint to (email, authority), and change User.get_by_email() to work when emails are no longer unique.

  • Allow usernames shorter than USERNAME_MIN_LENGTH, at least if authority is not "hypothes.is"

  • Allow usernames longer than USERNAME_MAX_LENGTH, at least if authority is not "hypothes.is"

  • Allow usernames that don't match USERNAME_PATTERN, at least if authority is not "hypothes.is"

  • Allow accounts with usernames that are currently considered too similar / that normalize to the same uid, even if those accounts have the same authority (at least if the authority is not "hypothes.is")

  • Allow user accounts with no password, at least when the authority is not "hypothes.is" and make sure these accounts cannot be logged-in to.

  • Allow user accounts with no email address, at least when the authority is not "hypothes.is". One thing to watch out for is that sending of reply notifications, account activation emails, password reset emails doesn't crash but instead do something sensible.

  • Enable using email addresses as usernames / enable usernames with @ characters in them (see notes below)

  • Enable adding non-hypothes.is users to feature cohorts

  • Enable NIPSAing non-hypothes.is users

  • Enable looking up non-hypothes.is users on /admin/users (for example to change their username, delete them, activate their account)

  • Enable non-hypothes.is userids to be authenticated when making HTTP requests to hypothes.is web pages. accounts.get_user() currently prevents this. It's probably easiest to leave this one until we've decided how "session promotion" will work, since until we have that you won't be able to make a request to a hypothes.is page and authenticate as a non-hypothes.is account anyway (e.g. it's not possible to login to one of these accounts in the normal way).


Below are detailed notes on different parts of the code...

URLs

I didn't attempt to write any tasks to do with URLs since I'm not sure what we want to do here but here's some discussion.

A lot of URLs contain usernames of the "seanh" form, some of these also correspond to usernames typed into search forms:

  • stream?q=user:seanh
  • /search?q=user%3Aseanh
  • /users/seanh/activity (potential activity page URL, doesn't exist yet)

Do we need to change all these URLs to include the domain part? /users/hypothes.is/seanh/activity or /users/someone@hypothes.is/activity?

Or do we continue leave out the domain part and assume @hypothes.is, and then when we want to support third-party accounts with these pages we namespace the entire URL by account? /plos.org/search, /plos.org/users/seanh/activity etc. Or subdomains: plos.hypothes.is/search, plos.hypothes.is/users/seanh/activity

The user table in the db

The user table has:

  1. username, e.g. seanh (this is what the user typed in when they registered) (unique, not nullable)
  2. id, eg. 33363 (this is the auto-incrementing integer primary key)
  3. uid, e.g. seanh, this is a normalised version of the username (remove .'s and lowercase) (unique, not null) This is to prevent people from having usernames that are different but look too similar to the eye.
  4. email (unique, not null)
  5. password (not null)

Notably this database table does not contain the domain part (hypothes.is) of a userid.

The normalization of usernames is a problem. We can't help it if a third-party accounts provider has usernames that our normalization thinks are too similar. We can either just remove this code, or we can somehow make it apply to @hypothes.is usernames only.

The unique constraints on username and uid will have to be weakened, to be unique constraints on (username, authority) and (uid, authority).

We restrict usernames to being longer than USERNAME_MIN_LENGTH and shorter than USERNAME_MAX_LENGTH. Since we can't control third-party usernames this restriction will have to be removed, or made to apply to hypothes.is accounts only, maybe moved from models to views or schemas code.

We restrict usernames to containing only letters, numbers, periods and underscores. We'll have to either remove this restriction or make it apply to hypothes.is accounts only. Maybe move from models to views or schemas code.

The unique constraint on email will have to become (email, authority) as well, or be removed entirely at least for non-hypothes.is accounts, since a third-party account provider may have different accounts with the same email.

We restrict emails to being EMAIL_MAX_LENGTH (100 characters) long. Again this restriction will have to be removed, or changed to apply to hypothes.is accounts only. Maybe moved from models to views or schemas code.

We also have a blacklist of usernames that we don't allow, but this is enforced in the schemas code not the models code, so it shouldn't be a problem (the blacklist can continue to apply to users who register new @hypothes.is user accounts but won't apply to usernames that we receive from third-party identity providers.

The @ character in usernames

Lots of websites use email addresses not usernames to login.

For example I login to foo.com using my email address seanh@mydomain.cc and a password.
I do not have any unique username on foo.com other than my email address.
Essentially seanh@mydomain.cc is my username on foo.com.

Just taking the seanh from seanh@mydomain.cc won't be unique, as there could also be a foo.com account with the email address seanh@other.com.

If foo.com were to embed Hypothesis and wanted users logged-in to foo.com to have Hypothesis accounts automatically created and logged-in to, then in the "authorization grant" JWT that foo.com sends to hypothes.is the "sub" field ("subject", the user that foo.com is asking us to create and log in) would have to be seanh@mydomain.cc.

And of course we could not strip that to seanh@foo.com because again that wouldn't be unique.

It seems like we would have to interpret the seanh@mydomain.cc foo.com account to a Hypothesis userid acct:seanh@mydomain.cc@foo.com.

In other words the username part of the userid has an @ in it.

Not only do we not currently allow @ characters in usernames but when taking the full userid and splitting it into username and domain parts we split on the first @, so from the userid acct:seanh@mydomain.cc@foo.com you would get:

username: seanh domain: mydomain.cc@foo.com

Which of course is wrong.

Suggested solution:

  • Drop the assumption that @ will not appear in usernames, and the restriction that @ isn't allowed in usernames
  • Adopt the assumption that @ will not appear in domains
  • When splitting a fully qualified userid into username and domain parts, split on the last @ not the first one

Retrieving users from the db

User.get_by_email() lowercases the given email address and returns the first match. This won't work when emails are no longer unique. Add authority arg to this method?

User.get_by_username() normalizes the given username (producing a uid) and returns the first user with this uid. This won't work when usernames and uid's are no longer unique. Add authority arg to this method? Or replace this with get_by_userid()?

account.get_user()

accounts.get_user() is a standalone function in accounts/__init__.py which seems inconsistent with the functions above with are User classmethods. (Also inconsistent naming style.) Anyway, it takes a full userid (acct:someone@hypothes.is), splits it into username and domain parts, returns None if the domain part is not request.auth_domain, otherwise calls User.get_by_username() with the username part.

What to make of this? Is the returning None behaviour actually relied on anywhere? This doesn't really seem like a sensible behaviour to have.

Events

The accounts code sends a bunch of events that other parts of the code subscribe to: ActivationEvent, LoginEvent, LogoutEvent, RegistrationEvent, PasswordResetEvent. These are sent with actual User objects, not username or userid strings, which since the user model doesn't include the domain means that the domain / authority part is not known to the event listener. But once we add an authority column to the user table it will be.

Admin Pages

  • Feature cohorts doesn't support adding non-hypothes.is users
  • NIPSA doesn't support nipsa'ing non-hypothes.is users
  • You can't add a non-hypothes.is user as an admin, though this may not be a problem
  • You can't add a non-hypothes.is user as staff, though this may not be a problem
  • You can't lookup non-hypothes.is users on /admin/users (for example to change their username, delete them, activate their account)

Groups

  • Each group has a unique, automatically generated "pubid", e.g. "b6k49n1Z", which is the actual unique identifier of the group. The group's name "Sean's cool group" or its normalized URL form "seans-cool-group" is not unique. So if we one day let third-party users create groups, this should work just fine (e.g. a hypothes.is user and a plos.org user could each create a group with the same name, they would get different pubids).

  • If we were to let a third-party.com user join a group that also contained hypothes.is users or vice-versa, then we might have a problem. Or at least it wouldn't make any sense. We may have to partition groups into hypothes.is groups and plos.org groups for this reason. (We could use the userid of the group creator to partition on.)

  • Fortunately group creator_id's in the database are user table ids (the autoincrementing integer pkeys) so, once we've added a domain/authority column to the user table, they can unambiguously be resolved to a fully qualified userid, which means we can tell whether a group was created by a third-party.com or a hypothes.is user.

  • We may also have to consider namespacing groups when we implement public groups. For example if we're showing a list of public groups that have annotated this document in the sidebar, would we want to show hypothes.is groups to plos.org users? Or only show them plos.org groups? And vice-versa. While we only have private groups this isn't an issue as long as we don't let users from different domains join each other's groups.

Document equivalence

Sean wonders whether in the future, when admins can control document equivalence in some way (e.g. by manually adding a URI to a document), third-party.com and other-third-party.com might disagree on whether two URIs represent the same document and each want to do it differently. Perhaps manually created URI equivalence rules will have to be able to be party-specific.

Things that seem to be okay as they are

  • Each annotation (both in the database, model code and search index) has a userid, these are full acct:MeganDreffs@hypothes.is form

  • request.authenticated_userid is already a full userid

  • The long-lived developer API tokens that we currently have already use a full userid

  • The JWT's that the client currently sometimes uses to authenticate to the web service already use full userids

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.