amolpujari/mongodb_data_modeling_and_implementations_notes.md

## mongodb_data_modeling_and_implementations_notes.md

      
    Raw
  

              mongodb_data_modeling_and_implementations_notes.md
            
          
    MongoDb Data Modeling Notes - v1.0

This is a rough document getting prepared for db design and architect level decisions to be made while dealing with area where one wants to consider MongoDB abd web services.
The tools would mainly include mongoDb, RSTful resources, background jobs, and possible queuing engine(RabbitMQ) to provide unit inputs to the background jobs. After measurements caching layers like ElasticSearch can be introduced as require. MySQL can be considered in few cases to support non-relational data storage as per need.
RESTful services and background jobs can be implemented using best frameworks available in Java. Or now a days extreme-service-developers are talking about wonders of Go language, hence would be great if Go language is considered for writing service layer. This is one very nice article about the REST services in Go
Conventions

Conventions make life easier, hence we would also have many conventions as below
RESTful URLs

All data must get sent over to device(app client) using REST only way and by following RESTful routes.
All the REST service must follow the below routing conventions.
example resource:
/users/        method="GET"     # :controller => 'users', :action => 'index'
/users/:id     method="GET"     # :controller => 'users', :action => 'show'
/users         method="POST"    # :controller => 'users', :action => 'create'
/users/signUp  method="POST"    # :controller => 'users', :action => 'signUp'
/users/:id     method="PUT"     # :controller => 'users', :action => 'update'
/users/:id     method="DELETE"  # :controller => 'users', :action => 'destroy'

HTTP methods

Any action that would change the existing attribute(s) value must be a PUT method
Any action that would add or create a new instance of any resource must be a POST method
Any action that would delete one or more resource(s) must be a DELETE method
Any action that would not change state of any object but simply accesses the resource(s) or their one or more attributes must be a GET method
Only GET request will be made to get any data from the service, no non-GET call is made to fetch any data from the service
ALL non-GET requests must return 200/201 or error cases
HEAD can be used to check the status of any resource, it can be used to return 300

Resource(s)'s Behaviors
A resource or resources could have behaviors like show, count, process, perform. These are different from resource attributes. We could consider the following mapping
"resources"    like name of the table 
"resource"     a single instance or one row of the table
"attribute"    one column of the table
"behavior"     one operation that could happen one a resource or a multiple

Behaviors can be classified into two ways


Class behavior
/resources/staticBehavior


Static behaviors are bound to multiple instances of any resource or they are like class level behaviors, and cannot be applied to any instance as such
/users/count


Instance behavior
/resources/:resourceId/instanceBehavior


these are applied to individual resource, example
/users/102/block

Note that "resource(s)" are nouns, and "behaviors" are verbs.
Kindly read more about RESTful resource here

http://microformats.org/wiki/rest/urls
http://stackoverflow.com/questions/207477/restful-url-design-for-search
http://stackoverflow.com/questions/1619152/how-to-create-rest-urls-without-verbs

Resource Naming
A resource name would appear in routes, it would appear in database, it would appear on application UI.
Lets follow following conventions
"resource"
is a singular entity
represent one instance of resource-class
"resources"
is a plural
represent multiple instances of resource-class, OR represent resource-class itself
e.g. user, and users
topicCategory, topicCategories
Finally It is important to name resource as they appear on the screen to end user
Web Service Layer

All Service API can be categorized as


CRUD API
Supporting all operations per resource. A CRUD service must focus only on CRUD operations and should not provide any other API.


Screen based API
Supporting data require per screen. This is the collection or grouping of API per screen. These are the APIs that are defined as per the need or requirement. For instance, the contacts screen would need the following APIs

GET   /users/:userId/contacts
GET   /users/:userId/contacts/appUsers
POST  /users/:userId/follow/:userIdToFollow
POST  /users/:userId/unfollow/:userIdToUnfollow
POST  /users/:userId/contacts


READ or WRITE API
Segregating read and write APIs, GET and non-GET methods
WRITE API must be monitored if they are taking longer. Such API can trigger background job for data processing they do.
READ API taking longer can be identified for caching opportunities


Screen Data Collections

These are few example data set require to be fetched through REST services. Here attempt is made to identify what data is needed per screen and then calling one or more API calls to get the require data. These would help us define APIs apart from CRUD operations. Following are the screens for examples


SPLASH
GET /api/v1/assets/forSplashScreen


This would return a list of assets for splash screen. asset is one of the resource type.
Please assume that all the service calls could be started with /api/v1/.


SingUp
POST /users/singUp


This would post various user attributes to create a new user. Service would first validate all the attributes, mainly the uniqueness and existing user and would respond accordingly.


Setup Mood
GET /moods
POST /users/765765/setupMood


Setup Topic
GET /topicCategories


User Contacts
GET /user/6576576/contacts
This would get all the user contacts
GET /user/6576576/contacts/appUsers
This would get user contacts who are using app


Persistence

Most of the data can be stored in mongoDb only. The only reason to store data in MySQL in few cases might be to use one or more relational-dbms features. mongoDb would indeed scale better than MySQL, and its a better choice among other NoSQL db likes couchDb, Cassandra.
MongoDb Collections

MongoDb is

document based NoSQL db, its a non-relational form of storing data
which is highly scalable, hence for most of the queries it would perform better than expected
it allows querying over numbers, strings, range
it supports indexes

The data modeling with mongoDb is mainly dependent on the application usage. For instance, user data. There are several attributes of user. as listed below including firstName, password and more. Even though we want each of our service call to respond faster, there are several service calls and few of them will hit often, and too often. Like user password verification call will not be made often, however user info(firstName, lastName, contacts etc) will get accessed more often. The user attributes that get accessed and/or queried often can be clubbed together in a single collection.
users attributes

- userId(_id)  ................................
- firstName                                   .
- lastName                                    .
- email                                       .
- photoUrl                                    .
- description                                 .
- isBlocked                                   .
- isDeleted                                   ...........  users collection
- contryCode                                  .
- phoneNumber                                 .
- handle       ................................

- userId    .....................   
- deviceId                      .
- phoneModel                    ....... userDevices
- vendor                        .
- osVersion                     .
- isInUse   .....................

- userId(_id)    ........
- lastSeenFromIP        ......................... userLastSeen collection
- geoLocation           .
- lastSeenInAt  .........

- userId             ........................   
- confirmationToken                         .
- confirmedAt                               .
- confirmationSentAt                        .
- encryptedPassword                         ..... userAccounts collection
- salt                                      .
- resetPasswordToken                        .
- resetPasswordSentAt .......................

All of the above user attributes can be divided into mongoDb collections. As shown above users and userAccounts. userAccounts attributes are required only at the time of signup, signin, forgot password, account confirmation, wherein users attributes are required most of the time by the application.
Again we may further divide users attributes into users and userProfiles as, it might not need the user's description all the time, this would reduce data transfer. userProfiles will be queried only when user sees or edits her own profile.
So important notes here are


Club resource attributes together in a single mongoDb doc, that are queried together or are require together for any screen or any process


Minimize data transfer as much as possible between service and the device, this is otherwise the thumb-rule at any level


Ideally user attributes can be cached by the application until it is marked by dirty or by using expiry tag. SDWebImage can be used to manage device cache by application.
The collection userLastSeen is a Most Write Collection, hence it require no index and it can be put in a different database, as write operations would put database level lock.
Following is the authentications collection, it is used to store user data when user connects or registers using oauth based options like Facebook/Google/Twitter.
authentications

userId
provider ENUM FB/G/T
uid
data
authError
lastSyncedContactsAt

On oauth based registrations screens, it needs to query many collections including users, userAccounts and authentications.
contacts

userId
contactId
isFollowing

Application needs to show one contacts screen where all users' contacts who are using app need to be listed with options to (un)follow them.
GET /users/982173/contacts/appUsers
In this case, the API would be finally return the following
Output: a list of contacts(app users) with their mini-profile details including name, photo_url etc. Here it needs to make mainly following queries

One to get contacts where userId is 982173
filter the contacts for users who lastSeen for last 6 months
2.1. (optional) filter contacts who have synced their contacts using one of F/B/T during last 6 months
Get mini-profile attributes from users collection for filtered contacts

Again, it can be decided not to follow above queries for a period of say 24 hours( configurable) and use the cached result. In this case it needs to have one more collection as
cachedAppContacts

userId(_id)
userContacts {embedded contacts list ready to be sent}

Any time this above table will be checked first, if not exists then follow above queries, write to cache(with TTL=24 hrs), this cache will get used for next 24 hours and mongoDb will auto-erase it after 24 hours. A separate database can be created for such cacheResults collections. It has been observed that, in such cases mongoDb could perform better than any caching engine like memcached or ElasticSearch. This can be verified by measurements.
Next, the Same screen could fire (un)follow requests as
POST /users/65761219/follow/78979821
This request will create/update a record in contacts collection as
{
  userId: 65761219,
  contactId: 78979821,
  isFollowing: true
}

POST /users/65761219/unfollow/78979821
This request will create/update a record in contacts collection as
{
  userId: 65761219,
  contactId: 78979821,
  isFollowing: false
}

The following POST call would initially populate the contacts collections
POST /users/7867861/contacts
=> payload: a list of contact numbers
This call would perform the following activities

Find or create users by contact numbers(countryCode+phoneNumber)
Find or create contacts for all the record found in query 1 above for the currentUser

If observed that this processing taking longer, then a background job can be defined for the same, thus the API c quickly return back 200 status. Ideally it should not take that long with mongoDb. However if this call get a great hit-count then it would a good practice to turn it into a background-jobs.
MongoDb Best Practices

Following are few important best practices one must consider while working with mongoDb.


Once a mongoDb collection is defined then it is better to have fix schema for it, that means all the documents in the collections are having same size, that means avoid having NULL values, have default for them. Or ensure that the collections that are getting queried often have very less scope for NULL value attributes. Such collections are fixed-size collections, and that support very high throughput for READ/WRITE operations. Such collections are called as Capped Collections. More about them here. Capped Collections are the first choice where recently inserted data gets read often.


When it comes to scaling, and it needs to compromise on one of data-duplication and data-transfer, then it should ignore data-duplication and ensure less data gets transfer between client and server


Define indexes over collections with high-read+low-write, do not have indexes on collections with high-write+low-read. Please read more here


Avoid full document update and update only require fields, make maximum use of mongoDb operators and modifiers, example for increment a counter field, use $inc


User short names for fields, they do eat memory


If a field is stored as Float, do not store an Integer value, always store it in Float form. Any similar changes to the document schema would impact the document space and would result in low performance


Avoid changes to the existing schema collection, if require define a new collection with fixed schema and migrate the data


Make use _id field, its a reserved primary key column, it gets its default unique value unless specified. It can be set ensuring its constraints are met


Use multiple databases. Database level locking lets you split up workloads across databases to avoid contention e.g. you could separate high throughput logging from an authentication database. Identify MostRead, AvgRead, MostWrite collections and keep them in separate databases.


Caching Opportunities


List all GET requests ordered by hit-count, avg-response-time by nginx access log
always monitor top GET call with max hit-count and avg-response-time on weekly basis
set email notification for such GET call that would notify once a week
this would require to scan nginx access log weekly


Caching could be required to done per user


Correctly identify when to erase cache


Any data collection(data being served by any API), if taking longer, we should move the data from MySQL to mongoDb, or then from mongoDb to ElasticSearch, or it could get served as static json file directly from nginx


Identify Most Read-only MySQL table from the beginning, knowing these would allow us to cache their SELECT queries efficiently using MySQL cache


GET requests with most hits counts must have their data in mongoDb where they it could get queried based on parameters, at such times serving it from MySQL would require more time compared to mongoDb


Some times data can be duplicated between MySQL and mongoDb, however the mongoDb data would remain query-ready, that means all data attributes will be stored and processed using MySQL and once processed, the selected attributes will be copied to mongoDb


Background Jobs

This section defines background jobs, their behaviors and best practices to be followed around.
Example background jobs

Update user social data
Update user device data
Update user's relevance index
Process Activity

Background Jobs: Best Practices


A job must be idempotent in nature
A job must do only the required operation and should not do anything else, examples

managing job queue for itself
introducing any cache or dealing with data persistence


A job should have clear ways of accepting inputs, examples

arguments

file based
database based


scheduled time
should have clear ways of providing outputs, examples

execution status
errors if any
output
total number to attempts to be made in case of errors


General Coding Practices


Each component, service, controller, action, method or any part of the system is suppose to perform a single job as per the name given to it, and hence it must focus on doing that job only nothing else by following single-responsibility-rule


A function should not have >50* lines of code


A file should not have >200* lines of code


All possible code smells should be avoided, kindly go through these links to know more about them

[http://en.wikipedia.org/wiki/Code_smell]
[http://www.codinghorror.com/blog/2006/05/code-smells.html]
[http://stackoverflow.com/questions/tagged/code-smell+java]