Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save amolpujari/8950439 to your computer and use it in GitHub Desktop.
Save amolpujari/8950439 to your computer and use it in GitHub Desktop.
MongoDB Data Modeling and Implementations Notes

MongoDb Data Modeling Notes - v1.0

This is a rough document getting prepared for db design and architect level decisions to be made while dealing with area where one wants to consider MongoDB abd web services.

The tools would mainly include mongoDb, RSTful resources, background jobs, and possible queuing engine(RabbitMQ) to provide unit inputs to the background jobs. After measurements caching layers like ElasticSearch can be introduced as require. MySQL can be considered in few cases to support non-relational data storage as per need.

RESTful services and background jobs can be implemented using best frameworks available in Java. Or now a days extreme-service-developers are talking about wonders of Go language, hence would be great if Go language is considered for writing service layer. This is one very nice article about the REST services in Go

Conventions

Conventions make life easier, hence we would also have many conventions as below

RESTful URLs

All data must get sent over to device(app client) using REST only way and by following RESTful routes. All the REST service must follow the below routing conventions.

example resource:

/users/        method="GET"     # :controller => 'users', :action => 'index'
/users/:id     method="GET"     # :controller => 'users', :action => 'show'
/users         method="POST"    # :controller => 'users', :action => 'create'
/users/signUp  method="POST"    # :controller => 'users', :action => 'signUp'
/users/:id     method="PUT"     # :controller => 'users', :action => 'update'
/users/:id     method="DELETE"  # :controller => 'users', :action => 'destroy'

HTTP methods

  • Any action that would change the existing attribute(s) value must be a PUT method
  • Any action that would add or create a new instance of any resource must be a POST method
  • Any action that would delete one or more resource(s) must be a DELETE method
  • Any action that would not change state of any object but simply accesses the resource(s) or their one or more attributes must be a GET method
  • Only GET request will be made to get any data from the service, no non-GET call is made to fetch any data from the service
  • ALL non-GET requests must return 200/201 or error cases
  • HEAD can be used to check the status of any resource, it can be used to return 300

Resource(s)'s Behaviors

A resource or resources could have behaviors like show, count, process, perform. These are different from resource attributes. We could consider the following mapping

"resources"    like name of the table 
"resource"     a single instance or one row of the table
"attribute"    one column of the table
"behavior"     one operation that could happen one a resource or a multiple

Behaviors can be classified into two ways

  • Class behavior

    /resources/staticBehavior

Static behaviors are bound to multiple instances of any resource or they are like class level behaviors, and cannot be applied to any instance as such

/users/count
  • Instance behavior

    /resources/:resourceId/instanceBehavior

these are applied to individual resource, example

/users/102/block

Note that "resource(s)" are nouns, and "behaviors" are verbs.

Kindly read more about RESTful resource here

Resource Naming

A resource name would appear in routes, it would appear in database, it would appear on application UI. Lets follow following conventions

"resource" is a singular entity represent one instance of resource-class

"resources" is a plural represent multiple instances of resource-class, OR represent resource-class itself

e.g. user, and users topicCategory, topicCategories

Finally It is important to name resource as they appear on the screen to end user

Web Service Layer

All Service API can be categorized as

  • CRUD API

    Supporting all operations per resource. A CRUD service must focus only on CRUD operations and should not provide any other API.

  • Screen based API

    Supporting data require per screen. This is the collection or grouping of API per screen. These are the APIs that are defined as per the need or requirement. For instance, the contacts screen would need the following APIs

    • GET /users/:userId/contacts
    • GET /users/:userId/contacts/appUsers
    • POST /users/:userId/follow/:userIdToFollow
    • POST /users/:userId/unfollow/:userIdToUnfollow
    • POST /users/:userId/contacts
  • READ or WRITE API

    Segregating read and write APIs, GET and non-GET methods

    WRITE API must be monitored if they are taking longer. Such API can trigger background job for data processing they do.

    READ API taking longer can be identified for caching opportunities

Screen Data Collections

These are few example data set require to be fetched through REST services. Here attempt is made to identify what data is needed per screen and then calling one or more API calls to get the require data. These would help us define APIs apart from CRUD operations. Following are the screens for examples

  1. SPLASH

    GET /api/v1/assets/forSplashScreen

This would return a list of assets for splash screen. asset is one of the resource type. Please assume that all the service calls could be started with /api/v1/.

  1. SingUp

    POST /users/singUp

This would post various user attributes to create a new user. Service would first validate all the attributes, mainly the uniqueness and existing user and would respond accordingly.

  1. Setup Mood

    GET /moods

    POST /users/765765/setupMood

  2. Setup Topic

    GET /topicCategories

  3. User Contacts

    GET /user/6576576/contacts

    This would get all the user contacts

    GET /user/6576576/contacts/appUsers

    This would get user contacts who are using app

Persistence

Most of the data can be stored in mongoDb only. The only reason to store data in MySQL in few cases might be to use one or more relational-dbms features. mongoDb would indeed scale better than MySQL, and its a better choice among other NoSQL db likes couchDb, Cassandra.

MongoDb Collections

MongoDb is

  • document based NoSQL db, its a non-relational form of storing data
  • which is highly scalable, hence for most of the queries it would perform better than expected
  • it allows querying over numbers, strings, range
  • it supports indexes

The data modeling with mongoDb is mainly dependent on the application usage. For instance, user data. There are several attributes of user. as listed below including firstName, password and more. Even though we want each of our service call to respond faster, there are several service calls and few of them will hit often, and too often. Like user password verification call will not be made often, however user info(firstName, lastName, contacts etc) will get accessed more often. The user attributes that get accessed and/or queried often can be clubbed together in a single collection.

users attributes

- userId(_id)  ................................
- firstName                                   .
- lastName                                    .
- email                                       .
- photoUrl                                    .
- description                                 .
- isBlocked                                   .
- isDeleted                                   ...........  users collection
- contryCode                                  .
- phoneNumber                                 .
- handle       ................................

- userId    .....................   
- deviceId                      .
- phoneModel                    ....... userDevices
- vendor                        .
- osVersion                     .
- isInUse   .....................

- userId(_id)    ........
- lastSeenFromIP        ......................... userLastSeen collection
- geoLocation           .
- lastSeenInAt  .........

- userId             ........................   
- confirmationToken                         .
- confirmedAt                               .
- confirmationSentAt                        .
- encryptedPassword                         ..... userAccounts collection
- salt                                      .
- resetPasswordToken                        .
- resetPasswordSentAt .......................

All of the above user attributes can be divided into mongoDb collections. As shown above users and userAccounts. userAccounts attributes are required only at the time of signup, signin, forgot password, account confirmation, wherein users attributes are required most of the time by the application.

Again we may further divide users attributes into users and userProfiles as, it might not need the user's description all the time, this would reduce data transfer. userProfiles will be queried only when user sees or edits her own profile.

So important notes here are

  • Club resource attributes together in a single mongoDb doc, that are queried together or are require together for any screen or any process

  • Minimize data transfer as much as possible between service and the device, this is otherwise the thumb-rule at any level

Ideally user attributes can be cached by the application until it is marked by dirty or by using expiry tag. SDWebImage can be used to manage device cache by application.

The collection userLastSeen is a Most Write Collection, hence it require no index and it can be put in a different database, as write operations would put database level lock.

Following is the authentications collection, it is used to store user data when user connects or registers using oauth based options like Facebook/Google/Twitter.

authentications

  • userId
  • provider ENUM FB/G/T
  • uid
  • data
  • authError
  • lastSyncedContactsAt

On oauth based registrations screens, it needs to query many collections including users, userAccounts and authentications.

contacts

  • userId
  • contactId
  • isFollowing

Application needs to show one contacts screen where all users' contacts who are using app need to be listed with options to (un)follow them.

GET /users/982173/contacts/appUsers

In this case, the API would be finally return the following

Output: a list of contacts(app users) with their mini-profile details including name, photo_url etc. Here it needs to make mainly following queries

  1. One to get contacts where userId is 982173
  2. filter the contacts for users who lastSeen for last 6 months 2.1. (optional) filter contacts who have synced their contacts using one of F/B/T during last 6 months
  3. Get mini-profile attributes from users collection for filtered contacts

Again, it can be decided not to follow above queries for a period of say 24 hours( configurable) and use the cached result. In this case it needs to have one more collection as

cachedAppContacts

  • userId(_id)
  • userContacts {embedded contacts list ready to be sent}

Any time this above table will be checked first, if not exists then follow above queries, write to cache(with TTL=24 hrs), this cache will get used for next 24 hours and mongoDb will auto-erase it after 24 hours. A separate database can be created for such cacheResults collections. It has been observed that, in such cases mongoDb could perform better than any caching engine like memcached or ElasticSearch. This can be verified by measurements.

Next, the Same screen could fire (un)follow requests as

POST /users/65761219/follow/78979821

This request will create/update a record in contacts collection as

{
  userId: 65761219,
  contactId: 78979821,
  isFollowing: true
}

POST /users/65761219/unfollow/78979821

This request will create/update a record in contacts collection as

{
  userId: 65761219,
  contactId: 78979821,
  isFollowing: false
}

The following POST call would initially populate the contacts collections

POST /users/7867861/contacts => payload: a list of contact numbers

This call would perform the following activities

  1. Find or create users by contact numbers(countryCode+phoneNumber)
  2. Find or create contacts for all the record found in query 1 above for the currentUser

If observed that this processing taking longer, then a background job can be defined for the same, thus the API c quickly return back 200 status. Ideally it should not take that long with mongoDb. However if this call get a great hit-count then it would a good practice to turn it into a background-jobs.

MongoDb Best Practices

Following are few important best practices one must consider while working with mongoDb.

  • Once a mongoDb collection is defined then it is better to have fix schema for it, that means all the documents in the collections are having same size, that means avoid having NULL values, have default for them. Or ensure that the collections that are getting queried often have very less scope for NULL value attributes. Such collections are fixed-size collections, and that support very high throughput for READ/WRITE operations. Such collections are called as Capped Collections. More about them here. Capped Collections are the first choice where recently inserted data gets read often.

  • When it comes to scaling, and it needs to compromise on one of data-duplication and data-transfer, then it should ignore data-duplication and ensure less data gets transfer between client and server

  • Define indexes over collections with high-read+low-write, do not have indexes on collections with high-write+low-read. Please read more here

  • Avoid full document update and update only require fields, make maximum use of mongoDb operators and modifiers, example for increment a counter field, use $inc

  • User short names for fields, they do eat memory

  • If a field is stored as Float, do not store an Integer value, always store it in Float form. Any similar changes to the document schema would impact the document space and would result in low performance

  • Avoid changes to the existing schema collection, if require define a new collection with fixed schema and migrate the data

  • Make use _id field, its a reserved primary key column, it gets its default unique value unless specified. It can be set ensuring its constraints are met

  • Use multiple databases. Database level locking lets you split up workloads across databases to avoid contention e.g. you could separate high throughput logging from an authentication database. Identify MostRead, AvgRead, MostWrite collections and keep them in separate databases.

Caching Opportunities

  • List all GET requests ordered by hit-count, avg-response-time by nginx access log always monitor top GET call with max hit-count and avg-response-time on weekly basis set email notification for such GET call that would notify once a week this would require to scan nginx access log weekly

  • Caching could be required to done per user

  • Correctly identify when to erase cache

  • Any data collection(data being served by any API), if taking longer, we should move the data from MySQL to mongoDb, or then from mongoDb to ElasticSearch, or it could get served as static json file directly from nginx

  • Identify Most Read-only MySQL table from the beginning, knowing these would allow us to cache their SELECT queries efficiently using MySQL cache

  • GET requests with most hits counts must have their data in mongoDb where they it could get queried based on parameters, at such times serving it from MySQL would require more time compared to mongoDb

  • Some times data can be duplicated between MySQL and mongoDb, however the mongoDb data would remain query-ready, that means all data attributes will be stored and processed using MySQL and once processed, the selected attributes will be copied to mongoDb

Background Jobs

This section defines background jobs, their behaviors and best practices to be followed around.

Example background jobs

  1. Update user social data
  2. Update user device data
  3. Update user's relevance index
  4. Process Activity

Background Jobs: Best Practices

  • A job must be idempotent in nature
  • A job must do only the required operation and should not do anything else, examples
    • managing job queue for itself
    • introducing any cache or dealing with data persistence
  • A job should have clear ways of accepting inputs, examples
    • arguments
      • file based
      • database based
    • scheduled time
    • should have clear ways of providing outputs, examples
      • execution status
      • errors if any
      • output
      • total number to attempts to be made in case of errors

General Coding Practices

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment