Skip to content

Instantly share code, notes, and snippets.

@simong
Last active December 30, 2015 06:29
Show Gist options
  • Select an option

  • Save simong/7789480 to your computer and use it in GitHub Desktop.

Select an option

Save simong/7789480 to your computer and use it in GitHub Desktop.

Cassandra data model

Publications

Contains the publication metadata necessary to generate pretty lists

publicationId displayName type date thumbnailUri publisher linkedContentId
p:abc123 Origin of Species book 123456789 remote:http//... Cambridge null
p:def456 SSTables in practice article 12345789 null NoSQL in the real world c:cam:kljs2341

PublicationsBySource

Maps a source ID to a publication ID Source IDs are a composite key of the source and their external identifier. Sources are citation indexes such as 'Web of Science', 'PubMed', 'Mendeley', .. This is so when we later need to support anything other than Symplectic, we can do some very basic disambiguation based on an external resource id.

sourceId publicationId
wos:123 p:abc123
mendeley:abc342 p:def456

PublicationsByUser

Maps a user id to a set of publications. In the example below simon and bert co-authored p:s13m13 This is so we generate a view that lists all the publications for a user

userId p:abc123 p:s13m13
u:cam:simong 1 1
u:cam:bert 1

PublicationsAuthors

Maps a publication ID to a set of OAE user ids (so we can get all the co-authors of a publication) This is essentially the inverse of the PublicationsByUser CF

publicationId u:cam:simong u:cam:bert
p:abc123 1
p:s13m13 1 1

SymplecticNewUsers

userId
u:cam:nico

Update cycles

Initial full update

  1. Get the list of users in symplectic (/users)
  2. For each symplectic-user, check if he's in OAE
  3. Each symplectic-user has a username=<val> attribute where maps to an external id from an authentication service
  4. Construct the loginId tenantAlias + ':' + authenticationStrategy + ':' + usernameVal
  5. Check AuthenticationLoginId if that matches an OAE user. If that exists, go to step 3, otherwise move on to the next symplectic user
  6. Get all the publications for that user
  7. GET /users/username-usernameVal/publications?details=full
  8. Ingest (get/create) all those publications
  9. Add the OAE user in PublicationsByUsers for each ingested publication
  10. Store the date lastRun=Date.now().time() somewhere (yet another CF?)

Incremental updates

Two things need to happen:

  1. Symplectic updates their index
  2. Get all the new/updated users that Symplectic added in their index since lastRun
  3. ingest these in the same way as we would do in a full update cycle
  4. New users in OAE
  5. For each new user in SymplecticNewUsers (full CF scan unfortunately)
  6. Get loginId from AuthenticationUserLoginId, get the externalId
  7. Get publications for that externalId: GET /users/username-usernameVal/publications?details=full
  8. Ingest (get/create) all those publications
  9. Add the OAE user, publication pairs in PublicationsByUsers and PublicationsAuthors
  10. Remove the userId from the SymplecticNewUsers CF
  11. Store the date lastRun=Date.now().time() somewhere (yet another CF?)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment