This is a rough document getting prepared for db design and architect level decisions to be made while dealing with area where one wants to consider MongoDB abd web services.
The tools would mainly include mongoDb, RSTful resources, background jobs, and possible queuing engine(RabbitMQ) to provide unit inputs to the background jobs. After measurements caching layers like ElasticSearch can be introduced as require. MySQL can be considered in few cases to support non-relational data storage as per need.
RESTful services and background jobs can be implemented using best frameworks available in Java. Or now a days extreme-service-developers are talking about wonders of Go language, hence would be great if Go language is considered for writing service layer. This is one very nice article about the REST services in Go
Conventions make life easier, hence we would also have many conventions as below
All data must get sent over to device(app client) using REST only way and by following RESTful routes. All the REST service must follow the below routing conventions.
example resource:
/users/ method="GET" # :controller => 'users', :action => 'index'
/users/:id method="GET" # :controller => 'users', :action => 'show'
/users method="POST" # :controller => 'users', :action => 'create'
/users/signUp method="POST" # :controller => 'users', :action => 'signUp'
/users/:id method="PUT" # :controller => 'users', :action => 'update'
/users/:id method="DELETE" # :controller => 'users', :action => 'destroy'
HTTP methods
- Any action that would change the existing attribute(s) value must be a
PUT
method - Any action that would add or create a new instance of any resource must be a
POST
method - Any action that would delete one or more resource(s) must be a
DELETE
method - Any action that would not change state of any object but simply accesses the resource(s) or their one or more attributes must be a
GET
method - Only GET request will be made to get any data from the service, no non-GET call is made to fetch any data from the service
- ALL non-GET requests must return 200/201 or error cases
- HEAD can be used to check the status of any resource, it can be used to return 300
Resource(s)'s Behaviors
A resource or resources could have behaviors like show, count, process, perform. These are different from resource attributes. We could consider the following mapping
"resources" like name of the table
"resource" a single instance or one row of the table
"attribute" one column of the table
"behavior" one operation that could happen one a resource or a multiple
Behaviors can be classified into two ways
-
Class behavior
/resources/staticBehavior
Static behaviors are bound to multiple instances of any resource or they are like class level behaviors, and cannot be applied to any instance as such
/users/count
-
Instance behavior
/resources/:resourceId/instanceBehavior
these are applied to individual resource, example
/users/102/block
Note that "resource(s)" are nouns, and "behaviors" are verbs.
Kindly read more about RESTful resource here
- http://microformats.org/wiki/rest/urls
- http://stackoverflow.com/questions/207477/restful-url-design-for-search
- http://stackoverflow.com/questions/1619152/how-to-create-rest-urls-without-verbs
Resource Naming
A resource name would appear in routes, it would appear in database, it would appear on application UI. Lets follow following conventions
"resource" is a singular entity represent one instance of resource-class
"resources" is a plural represent multiple instances of resource-class, OR represent resource-class itself
e.g. user, and users topicCategory, topicCategories
Finally It is important to name resource as they appear on the screen to end user
All Service API can be categorized as
-
CRUD API
Supporting all operations per resource. A CRUD service must focus only on CRUD operations and should not provide any other API.
-
Screen based API
Supporting data require per screen. This is the collection or grouping of API per screen. These are the APIs that are defined as per the need or requirement. For instance, the contacts screen would need the following APIs
- GET /users/:userId/contacts
- GET /users/:userId/contacts/appUsers
- POST /users/:userId/follow/:userIdToFollow
- POST /users/:userId/unfollow/:userIdToUnfollow
- POST /users/:userId/contacts
-
READ or WRITE API
Segregating read and write APIs, GET and non-GET methods
WRITE API must be monitored if they are taking longer. Such API can trigger background job for data processing they do.
READ API taking longer can be identified for caching opportunities
These are few example data set require to be fetched through REST services. Here attempt is made to identify what data is needed per screen and then calling one or more API calls to get the require data. These would help us define APIs apart from CRUD operations. Following are the screens for examples
-
SPLASH
GET /api/v1/assets/forSplashScreen
This would return a list of assets for splash screen. asset is one of the resource type.
Please assume that all the service calls could be started with /api/v1/
.
-
SingUp
POST /users/singUp
This would post various user attributes to create a new user. Service would first validate all the attributes, mainly the uniqueness and existing user and would respond accordingly.
-
Setup Mood
GET /moods
POST /users/765765/setupMood
-
Setup Topic
GET /topicCategories
-
User Contacts
GET /user/6576576/contacts
This would get all the user contacts
GET /user/6576576/contacts/appUsers
This would get user contacts who are using app
Most of the data can be stored in mongoDb only. The only reason to store data in MySQL in few cases might be to use one or more relational-dbms features. mongoDb would indeed scale better than MySQL, and its a better choice among other NoSQL db likes couchDb, Cassandra.
MongoDb is
- document based NoSQL db, its a non-relational form of storing data
- which is highly scalable, hence for most of the queries it would perform better than expected
- it allows querying over numbers, strings, range
- it supports indexes
The data modeling with mongoDb is mainly dependent on the application usage. For instance, user data. There are several attributes of user. as listed below including firstName, password and more. Even though we want each of our service call to respond faster, there are several service calls and few of them will hit often, and too often. Like user password verification call will not be made often, however user info(firstName, lastName, contacts etc) will get accessed more often. The user attributes that get accessed and/or queried often can be clubbed together in a single collection.
users attributes
- userId(_id) ................................
- firstName .
- lastName .
- email .
- photoUrl .
- description .
- isBlocked .
- isDeleted ........... users collection
- contryCode .
- phoneNumber .
- handle ................................
- userId .....................
- deviceId .
- phoneModel ....... userDevices
- vendor .
- osVersion .
- isInUse .....................
- userId(_id) ........
- lastSeenFromIP ......................... userLastSeen collection
- geoLocation .
- lastSeenInAt .........
- userId ........................
- confirmationToken .
- confirmedAt .
- confirmationSentAt .
- encryptedPassword ..... userAccounts collection
- salt .
- resetPasswordToken .
- resetPasswordSentAt .......................
All of the above user attributes can be divided into mongoDb collections. As shown above users and userAccounts. userAccounts attributes are required only at the time of signup, signin, forgot password, account confirmation, wherein users attributes are required most of the time by the application.
Again we may further divide users attributes into users and userProfiles as, it might not need the user's description all the time, this would reduce data transfer. userProfiles will be queried only when user sees or edits her own profile.
So important notes here are
-
Club resource attributes together in a single mongoDb doc, that are queried together or are require together for any screen or any process
-
Minimize data transfer as much as possible between service and the device, this is otherwise the thumb-rule at any level
Ideally user attributes can be cached by the application until it is marked by dirty or by using expiry tag. SDWebImage can be used to manage device cache by application.
The collection userLastSeen is a Most Write Collection, hence it require no index and it can be put in a different database, as write operations would put database level lock.
Following is the authentications collection, it is used to store user data when user connects or registers using oauth based options like Facebook/Google/Twitter.
authentications
- userId
- provider ENUM FB/G/T
- uid
- data
- authError
- lastSyncedContactsAt
On oauth based registrations screens, it needs to query many collections including users, userAccounts and authentications.
contacts
- userId
- contactId
- isFollowing
Application needs to show one contacts screen where all users' contacts who are using app need to be listed with options to (un)follow them.
GET /users/982173/contacts/appUsers
In this case, the API would be finally return the following
Output: a list of contacts(app users) with their mini-profile details including name, photo_url etc. Here it needs to make mainly following queries
- One to get contacts where userId is 982173
- filter the contacts for users who lastSeen for last 6 months 2.1. (optional) filter contacts who have synced their contacts using one of F/B/T during last 6 months
- Get mini-profile attributes from users collection for filtered contacts
Again, it can be decided not to follow above queries for a period of say 24 hours( configurable) and use the cached result. In this case it needs to have one more collection as
cachedAppContacts
- userId(_id)
- userContacts {embedded contacts list ready to be sent}
Any time this above table will be checked first, if not exists then follow above queries, write to cache(with TTL=24 hrs), this cache will get used for next 24 hours and mongoDb will auto-erase it after 24 hours. A separate database can be created for such cacheResults collections. It has been observed that, in such cases mongoDb could perform better than any caching engine like memcached or ElasticSearch. This can be verified by measurements.
Next, the Same screen could fire (un)follow requests as
POST /users/65761219/follow/78979821
This request will create/update a record in contacts collection as
{
userId: 65761219,
contactId: 78979821,
isFollowing: true
}
POST /users/65761219/unfollow/78979821
This request will create/update a record in contacts collection as
{
userId: 65761219,
contactId: 78979821,
isFollowing: false
}
The following POST call would initially populate the contacts collections
POST /users/7867861/contacts => payload: a list of contact numbers
This call would perform the following activities
- Find or create users by contact numbers(countryCode+phoneNumber)
- Find or create contacts for all the record found in query 1 above for the currentUser
If observed that this processing taking longer, then a background job can be defined for the same, thus the API c quickly return back 200 status. Ideally it should not take that long with mongoDb. However if this call get a great hit-count then it would a good practice to turn it into a background-jobs.
Following are few important best practices one must consider while working with mongoDb.
-
Once a mongoDb collection is defined then it is better to have fix schema for it, that means all the documents in the collections are having same size, that means avoid having NULL values, have default for them. Or ensure that the collections that are getting queried often have very less scope for NULL value attributes. Such collections are fixed-size collections, and that support very high throughput for READ/WRITE operations. Such collections are called as
Capped Collections
. More about them here. Capped Collections are the first choice where recently inserted data gets read often. -
When it comes to scaling, and it needs to compromise on one of data-duplication and data-transfer, then it should ignore data-duplication and ensure less data gets transfer between client and server
-
Define indexes over collections with high-read+low-write, do not have indexes on collections with high-write+low-read. Please read more here
-
Avoid full document update and update only require fields, make maximum use of mongoDb operators and modifiers, example for increment a counter field, use
$inc
-
User short names for fields, they do eat memory
-
If a field is stored as Float, do not store an Integer value, always store it in Float form. Any similar changes to the document schema would impact the document space and would result in low performance
-
Avoid changes to the existing schema collection, if require define a new collection with fixed schema and migrate the data
-
Make use
_id
field, its a reserved primary key column, it gets its default unique value unless specified. It can be set ensuring its constraints are met -
Use multiple databases. Database level locking lets you split up workloads across databases to avoid contention e.g. you could separate high throughput logging from an authentication database. Identify MostRead, AvgRead, MostWrite collections and keep them in separate databases.
-
List all GET requests ordered by hit-count, avg-response-time by nginx access log always monitor top GET call with max hit-count and avg-response-time on weekly basis set email notification for such GET call that would notify once a week this would require to scan nginx access log weekly
-
Caching could be required to done per user
-
Correctly identify when to erase cache
-
Any data collection(data being served by any API), if taking longer, we should move the data from MySQL to mongoDb, or then from mongoDb to ElasticSearch, or it could get served as static json file directly from nginx
-
Identify Most Read-only MySQL table from the beginning, knowing these would allow us to cache their SELECT queries efficiently using MySQL cache
-
GET requests with most hits counts must have their data in mongoDb where they it could get queried based on parameters, at such times serving it from MySQL would require more time compared to mongoDb
-
Some times data can be duplicated between MySQL and mongoDb, however the mongoDb data would remain query-ready, that means all data attributes will be stored and processed using MySQL and once processed, the selected attributes will be copied to mongoDb
This section defines background jobs, their behaviors and best practices to be followed around.
Example background jobs
- Update user social data
- Update user device data
- Update user's relevance index
- Process Activity
- A job must be idempotent in nature
- A job must do only the required operation and should not do anything else, examples
- managing job queue for itself
- introducing any cache or dealing with data persistence
- A job should have clear ways of accepting inputs, examples
- arguments
- file based
- database based
- scheduled time
- should have clear ways of providing outputs, examples
- execution status
- errors if any
- output
- total number to attempts to be made in case of errors
- arguments
-
Each component, service, controller, action, method or any part of the system is suppose to perform a single job as per the name given to it, and hence it must focus on doing that job only nothing else by following single-responsibility-rule
-
A function should not have >50* lines of code
-
A file should not have >200* lines of code
-
All possible code smells should be avoided, kindly go through these links to know more about them