Skip to content

Instantly share code, notes, and snippets.

@MarkPThomas
Last active November 3, 2021 15:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MarkPThomas/6f8bf2493309b1daf45cc4641e596a6c to your computer and use it in GitHub Desktop.
Save MarkPThomas/6f8bf2493309b1daf45cc4641e596a6c to your computer and use it in GitHub Desktop.
2021 - SDC Journal Homepage

2021 SDC Engineering Journal

This gist details the decisions I made, the tests I conducted, the observations recorded, and the directions I took while building out a back end API service for the new Atelier retail app in the SDC (System Design Capstone) project for Hack Reactor.

Table of Contents


The following is the template and some additional prompts that will be used for the journal moving forward.


Goals & Motivations

Every action you take should be driven by some challenge you are facing. Many times this will be simply the requirement of the project. However, you should be able to frame each challenge in your own terms. Own the motivations in the context of your own project.

Actions Taken

This portion should be the most detailed. Make sure to be technically precise and remove any ambiguity related to the action you are taking.

Achievements

Results Observed

Results Impacts

Reflections

Daily Summary

First Journal Page

2021-08-26

Database Selection: SQL or NoSQL?

To start off our project, we had to decide which database to make as the primary vs. secondary database and how to actually use both.

I chose to have MongoDB as the primary database (or typ. first encountered) as for simple read-write actions it is faster. However, I am setting it up to probably serve as a caching layer, with all of the data ultimately stored & reconciled in Postgres as the source of truth.

Postgres is generally better for faster insertions/updates and more complex queries made possible by table joins, so such queries will be forwarded directly here.

Actions where Postgres is faster will be directed straight to Postgres, and vice versa with Mongo. However, Mongo will not have all of the data at all times. If it is missing requested data, it will fetch it from Postgres. And data written to Mongo will be pushed to Postgres on a separate thread while the original call returns.

I don't know if I can/will have any metrics monitoring, but if one system is slowing down from high request rates, I can then switch to the other system to balance loads temporarily.

ORM Considerations

In addition to this database selection, I have decided to use ORM (object relational mapping) frameworks for each one (Mongoose for MongoDB, and Sequelize for Postgres). At this stage I am not sure if using ORMs will provide a performance penalty, but it seemed like a good way to design a server to be agnostic as to what databases it is using. To abstract out the database implementation details. Not only can other databases be swapped for Primary/Secondary in the future, but if they use the same ORM (e.g. switchin from Postgres to MySQL), for the most part, only connection details need to be changed in the higher level database controller code in the server.

Basic Design & Setup of Postgres

First step was to design my SQL relations and code the basic Postgres database.

Design

Front End App Component

I have chosen to make the backend API routes for the ratings and reviews component.

Entity Diagram

Constructing the Database with Sequelize


Discuss handling race condition in Postgres when adding characteristics.

ETL: Extract, Transform, Load

Since the end goal of our project is to stress test the databases and server routes in a variety of ways, and at a variety of levels, we need a database populated with a dataset scaled to simulate real-world conditions. For the ratings and reviews component, I was given 4 CSV files comprising data dumped from a third party source, with the following metrics of file size and entries below:

TODO: Screenshot of spreadsheet breakdown

In this state of the project, I had to work through an ETL (Extract, Transform, Load) process in order to populate my database, in the following ways:

  1. Extract the CSV data into a workable format, such as JSON/Javascript
  2. Transform the data from the arrangement from the original data source to fit my databse, which has a different structure. This includes re-deriving metadata that the APIs will call that is not supplied in the CSV files.
  3. Load The transformed data into my database.

Although this process was to be done for each database chosen, to me it made sense to only do it fully for the Postgres database as it serves as the ultimate 'source of truth'. Conversion methods later written to push data and back and forth betweent the two databases could be used later to generate whatever is needed in the MongoDB, either on the fly, or for some pre-caching data.

Extract

Initially when I performed ETL on a stream reading from a CSV file to JSON to the database, although it worked fine for small sets of data, the rate was too much for Sequelize when processing large files, and the process quickly stalled out.

I broke up the process to better calibrate the loads on Sequelize. The stream was mostly used to build up a cache system of data & requests that could then be held in memory and emptied at a slower rate. This system was comprised of asynchronous caches (With up to 100 running at a time) that each executed a synchronous set of queues (up to 10,000 for each).

In the even that this isn't enough, I can always add some system monitoring or other algorithm in either cache holder that could adjust the rate of invocations of queued callbacks. I made a placeholder method in use already, so stubbing in such behavior would be easy with the existing system.

TODO: Chunking Streams when converting CSV to JSON to DB Keyang/node-csvtojson#135 https://nodejs.org/api/stream.html#stream_simplified_construction https://nodesource.com/blog/understanding-streams-in-nodejs/ https://medium.com/developers-arena/streams-piping-and-their-error-handling-in-nodejs-c3fd818530b6

TODO: CSV to JSON https://medium.com/nonstopio/import-csv-file-data-into-mongodb-database-using-node-js-bb21afdebc31 https://stackoverflow.com/questions/16831250/how-to-convert-csv-to-json-in-node-js https://www.npmjs.com/package/csvtojson#api

Transform

Load

Shared Testing Concerns Between DBs

Setting Env Variables

https://stackoverflow.com/questions/11104028/why-is-process-env-node-env-undefined

console.log('Environment:', process.env.NODE_ENV); console.log(process.env.NODE_ENV === 'development') // false console.log(Expect ${'development'.length}:, process.env.NODE_ENV.length) // 11 (including a space at the end)

"envTestDev": "export NODE_ENV='development' && node envTest.js", "envTestTest": "export NODE_ENV='test' && node envTest.js", "envTestProd": "export NODE_ENV='production' && node envTest.js",

Screen Shot 2021-09-17 at 11 43 07 PM

DB Testing of Postgres w/ Sequelize & Jest

Controller Design & Testing Between MongoDB & Postgres DB

discuss modules, injection to anonymous functions

Controller Testing Between MongoDB & Postgres DB

discuss integration testing

Testing Full API Route with Postgres

Testing Full API Route with MongoDB

DBMS Performance and Service Stress Testing

Goals & Motivations

Now that the database is up and running, I need to get some basic metrics of how fast my read queries are.

Checking my test database will be helpful for establishing a basic metric for database speed in it's most efficient case of a simple operation, a single request, on a small database of a consistent and predictable form.

I also need to get some basic metrics of how fast my read queries are in a situation more similar to real life, so I will run the same tests on my production database. This will be helpful for establishing a basic metric for database speed in it's most efficient case of a simple operation, a single request, but now on a large database with a less predictable form.

Databases Tested

  1. Test database (small # of entries, known state, faster runs)
  2. Production database (large # of entries, unknown state, slower runs)

Query times are checked utilizing the server APIs, and will be run after each improvement is made so as to see the affects of various improvements.

Categories Tested

For each database, I will look up a few records via API paths that tap into each of the following categories:

  • Review
  • Product Metadata
  • Product Reviews

For the test database, I might as well get the most efficient response to establish an upper bound, so I will look up the earliest few records for each of the categories.

For the production-scale database, I might as well get the least efficient response to establish a lower bound, so I will look up how many records there are in a table in order to then search for the last few records for each of the categories.

  • Review: 5,765,293 records
  • Product: 1,000,011 records
  • Metadata: records

Note that I am only including tables that are likely to be used in the API. Basic tables such as products, users, characteristics, etc. are likely to only be obtained IN COMBINATION with other tables.

DBMS Performance Testing

Goals & Motivations

Now that the databases are fully formed and connected to the server with API routes, it is time to test the speed of various database queries in isolation.

Actions Taken

For each API route, a request was made roughly 10 times using Postman, with the total time from request to response taken. These times were recorded and averaged.

The following images show sample Postman output for the API paths tested on the test database:

Screen Shot 2021-10-11 at 4 37 42 PM

Screen Shot 2021-10-11 at 4 37 25 PM

Screen Shot 2021-10-11 at 4 36 22 PM

Review Routes

Screen Shot 2021-10-11 at 4 46 25 PM

Screen Shot 2021-10-11 at 4 46 44 PM

Product Reviews Routes

Screen Shot 2021-10-11 at 4 49 29 PM

Screen Shot 2021-10-11 at 4 49 44 PM

Product Reviews Metadata Routes

Service Stress Testing

Goals & Motivations

Another testing goal is local stress testing of the databases to use as a quantitative basis for improvements. These quantitative metrics were RPS (requests per second), latency, and error rate.

Actions Taken

K6 was chosen as my testing framework for stress testing the database. I first set it up and learned the ins and outs of the system by running and calibrating it on my test database. This seemed like a good idea as I had fewer variables to consider and could focus purely on the inputs and outputs of all of my database route paths. I could test reads, writes, and updates without worrying about contaminating my larger database, and the iterations while experimenting with the system were fast. Additionally, since my test database is created fresh and populated with seed data automatically as a start point before any test, it can be easy to automate this and test predictable database states.

Tests were done with a constant request load sustained over time, as well as a ramping request load rate. The targeted test metrics are as follows:

  1. Constant Request Rates (held for 1 min) TBD (image of test config)

  2. Ramping Request Rates (each held for 1 min) TBD (image of test config)

For each type of loading, the service was tested for the following cases, if possible:

  1. 1 rps
  2. 10 rps
  3. 100 rps
  4. 1000 rps
  5. 10,000 rps

Setting Up the Postgres Production Database

In order to maintain good practices, I am still avoiding touching the 'production' database created from ETL, and instead am cloning the database through dump and restore commands.

Actions Taken

The development database generated from ETL was copied to a new database by running the following dump command:

pg_dump -U postgres -d sourcedb -f sourcedb.sql

and then the following restore command to generate the database to stress test on a fresh instance:

psql -U postgres -d targetdb -f sourcedb.sql

It is worth noting that while ETL took a VERY long time to run, these commands, which tap into native methods and lower-level implementations were much faster. For example, to copy the entire 5.1 GB of CSV data:

  • Dump Time: 31 seconds
  • *.SQL File Size: 3.7 GB
  • Load Time: 6 min

Once this was completed, I could then take my existing testing setups and merely point them to the new database location to run the desired tests.

Achievements

A new database was deployed and tested, allowing me to not only get current metrics for a database, but have a nice A/B comparison between a smaller and larger database based on the same schema.

Postgres Read Queries - Performance and Stress Testing - Overall Results

Improvements Tested

I will run these tests after each improvement to the database is made in order to more easily see where and how much each improvement made a difference. The improvements I will make for Postgres are as follows:

  1. Indexing - Many queries are using the public IDs in order to do database lookups. These are currently not indexed.
  2. Table joins - Current queries are chaining promises to find data in tables based on prior query lookup results
  3. Max Connections - 400
  4. Shared Buffer - 250 MB
  5. Pool max - 400

Goal from Improvements

The target read speed for queries is for all reads to be under 50ms, but ideally as short as 10ms. Target requests per second of the service are expected to be able to reach 1000 rps with under a 1% error rate and 2000ms latency.

Overall Results Observed - Test Database

Screen Shot 2021-10-11 at 4 35 37 PM

IDs chosen are meant to include a range of results that may affect times, such as one-to-many and many-to-may relations, such as reviews with characteristics or photos. I chose ids with a range of states.

The following are the queries and resulting execution times on the test-scale database initially and after each improvement:

The following are the initial stress test results on the test-scale service for constant load:

The following are the final stress test results on the test-scale service for constant load:

The following are the initial stress test results on the test-scale service for ramped load:

The following are the final stress test results on the test-scale service for ramped load:

Overall Results Observed - Production-Scale Database

Screen Shot 2021-10-11 at 7 18 58 PM

IDs chosen are meant to include a range of results that may affect times, such as one-to-many and many-to-may relations, such as reviews with characteristics or photos. In this 'worst case' set of tests, I aimed to have all ids (from low to high within the total range in the DB) include table joins with BOTH at least one characteristic and at least one photo.

Some interesting notes in general are that the first query always takes substantially longer than the rest (by about a factor of 5), as if the database is needing to warm up. Additionally, if the same query is repeated within a short enough period of time, even with other queries in between, the time seems to drop in a converging pattern, as if some sort of caching system is coming in to play. This is apparent when viewing each of the timed DB queries. These two factors (# of queries since the first one, # of queries in a row), has a MUCH larger effect than how high of an id number is queried within the database.

The following are the queries and resulting execution times on the production-scale initially and after each improvement: DB Query Time - Summary - Scale

The following are the stress test results on the production-scale service for constant load comparing the initial to final states, with key improvement steps noted: K6_constant_summary

The following are the stress test results on the production-scale service for ramping load comparing the initial to final states, with key improvement steps noted: K6_ramping_summary

Results Impacts

Reflections

Indexing resulted in the greatest improvement, by reducing query times by 55.94% to 99.87%. Table joins also helped, but to a lesser extent of reducing query times by up to 65.1%.

Increasing max connections improved the speed and reliability in queries doing less work and returning less data, but there were also more errors in the connections being overwhelmed, which caused a higher failure rate and inconsistent results as some queries took a long time or never finished when this occurred. Increasing the max buffer helped somewhat, but the results were still rather mixed. This shows the importance of careful tuning of the database configuration as increasing capacities may not lead to desired improvements.

Postgres Read Queries - Performance and Stress Testing - Initial Results

Results Observed - Test Database

The following are the queries and resulting execution times on the test-scale database initially: Screen Shot 2021-10-11 at 9 31 41 PM

The following are the initial stress test results on the test-scale service for constant load: k6_test_review_constant_summary k6_test_productReviews_constant_summary k6_test_productReviewMetadata_constant_summary

The following are the initial stress test results on the test-scale service for ramped load: k6_test_review_ramp_summary k6_test_productReviews_ramp_summary

Results Observed - Production-Scale Database

The following are the queries and resulting execution times on the production-scale initially: Screen Shot 2021-10-11 at 9 31 51 PM

The following are the initial stress test results on the production-scale service for constant load: k6_scale_review_summary k6_scale_productReviews_summary k6_test_productReviewsMetadata_ramp_summary

The following are the initial stress test results on the production-scale service for ramped load: (Missed this before I moved on to indexing foreign keys. Oops!)

Reflections

Some interesting notes in general are that the first query always takes substantially longer than the rest (by about a factor of 5), as if the database is needing to warm up. Additionally, if the same query is repeated within a short enough period of time, even with other queries in between, the time seems to drop in a converging pattern, as if some sort of caching system is coming in to play. This is apparent when viewing each of the timed DB queries. These two factors (# of queries since the first one, # of queries in a row), has a MUCH larger effect than how high of an id number is queried within the database.

Also, when the number of entries in the database increase from a few dozen to millions, the slower queries end up flipping (as expected) to the ones that require more lookups and joins.

Additionally, what looks like reasonable times on the test database end up clearly over the target limit of 10-50ms limits for queries on the full-scale database. This shows the importance of testing database design with realistically scaled data included, in addition to just the mechanical workings of the schema & queries.

Postgres Read Queries - Performance and Stress Testing - Indexing Results

Goals & Motivations

The first bit of low hanging fruit to improve the database is to index strategically. In the tables, there is a public set of IDs which are NOT used as the primary keys, since they might change. Since the API relies on public keys, this means that these non-indexed keys are ALWAYS used as a starting point for any request, rather than the indexed primary keys.

Additionally, it turns out that although I took care to use primary keys as foreign key references within the database, Postgres DOES NOT index these foreign keys where they are used, just defined. This means that any 1:m and n:m tables are in need of their foreign keys being indexed. This article explains the issue well.

Actions Taken

I indexed the following keys in the Postgres database, either in the table definitions (per the test database case), or by modifying the existing database (per the production database case where the existing data must be preserved): Screen Shot 2021-10-11 at 9 34 12 PM

Results Observed

The following are the queries and resulting execution times on the test-scale database:

The following are the queries and resulting execution times on the production-scale database: DB Query Time - Production


The following are the stress test results on the test-scale service for constant load:

The following are the stress test results on the test-scale service for ramped load:

The following are the stress test results on the production-scale service for constant load:

The following are the stress test results on the production-scale service for ramped load:

Results Impacts

Reflections

Postgres Read Queries - Performance and Stress Testing - Table Joins Results

Goals & Motivations

Actions Taken

Results Observed

The following are the queries and resulting execution times on the test-scale database:

The following are the queries and resulting execution times on the production-scale database:


The following are the stress test results on the test-scale service for constant load:

The following are the stress test results on the test-scale service for ramped load:

The following are the stress test results on the production-scale service for constant load:

The following are the stress test results on the production-scale service for ramped load:

Results Impacts

Reflections

Postgres Read Queries - Performance and Stress Testing - Max Databse Connections Results

The Postgres database on my local computer has a default max connection of 100. I chose to increase it initially to 400. It sounded like some calibration would be needed to find the optimal number of connections, so I decided to do the following:

  1. Run a constant load in K6 to note completion % and latency
  2. Do this for a constant load for 1 min of 1rps, 10 rps, 100 rps & 1000 rps in order to see improvements and see a rounding of the 'elbow' in efficiency. This was first done on a single review for a faster/simpler case, and then re-ran on a product reviews path to have maximum complexity of a read, since this involves many table joins across 1:m and n:m tables on many reviews.
  3. Once confidence was gained that 400 is a reasonable starting point, max connections would be increased by doubling and only the largest reached rps load on the product reviews is tested. A binary search strategy will be used where a iteration or few of these large increases are checked until improvement decreases or flatlines. Once this occurs, a max connection between this state and the last best will be tested. This will be incremented a couple of times until a satisfactory max step is reached.

Max_connections_400_review

**Results of first K6 Run on Review Reads after Max Connections Increased to 400**

It was clear that, as expected, at low request volumes, this change had no noticeable effect on latency, but once volume reached prior levels that crashed tests, there was a noticeable improvement, and a higher rps load was able to be reached.

It is also worth noting that there were more timeouts, which occurred earlier on in the runtime, which makes sense if more requests are being made and another bottleneck further down the pipeline is being triggered.

Once working out that the ideal max_connections seemed to be about 400, I then re-did the tests with pool max, which is a Sequelize setting. It was a bit unclear to me just how it was different, apart from maybe it throttling the requests a bit by limiting the max number of requests that can be made per app instance.

The default max pool connection is 5, so for a large jump I first tried increasing it to the current max_connections.

Memory allocation https://www.postgresql.fastware.com/blog/back-to-basics-with-postgresql-memory-components

For choice of 400 max connections https://wiki.postgresql.org/wiki/Number_Of_Database_Connections?fbclid=IwAR2eWVxZFJiBfREcFO0ssK5QV1Qrqo-_C04CHTbVwn0UQzEYi2ljIBIy1yE https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server 'Generally, PostgreSQL on good hardware can support a few hundred connections. If you want to have thousands instead, you should consider using connection pooling software to reduce the connection overhead.'

https://dev.mysql.com/doc/refman/5.6/en/connection-interfaces.html 'Linux or Solaris should be able to support at least 500 to 1000 simultaneous connections routinely '

Goals & Motivations

Actions Taken

Results Observed

The following are the queries and resulting execution times on the test-scale database:

The following are the queries and resulting execution times on the production-scale database:


The following are the stress test results on the test-scale service for constant load:

The following are the stress test results on the test-scale service for ramped load:

The following are the stress test results on the production-scale service for constant load:

The following are the stress test results on the production-scale service for ramped load:

Results Impacts

Reflections

Postgres Read Queries - Performance and Stress Testing - Shared Buffer Results

Default started at 128 MB

For choice of 250 MB - AWS server has 1GB RAM https://www.postgresql.org/docs/9.1/runtime-config-resource.html 'If you have a dedicated database server with 1GB or more of RAM, a reasonable starting value for shared_buffers is 25% of the memory in your system.'

Goals & Motivations

Actions Taken

Results Observed

The following are the queries and resulting execution times on the test-scale database:

The following are the queries and resulting execution times on the production-scale database:


The following are the stress test results on the test-scale service for constant load:

The following are the stress test results on the test-scale service for ramped load:

The following are the stress test results on the production-scale service for constant load:

The following are the stress test results on the production-scale service for ramped load:

Results Impacts

Reflections

Postgres Read Queries - Performance and Stress Testing - Max Pool Connections Results

Same as max connection?

https://betterprogramming.pub/how-to-avoid-bottlenecks-in-node-js-applications-8085d86b6b2e For choice of 400 max pool 'The downside of this approach is that when the connection pool limit is reached, any new request will have to wait for a connection to be released by the previous ones. This is why the pool size should be chosen carefully. If your data is not very large or you do not expect many simultaneous users, you might not encounter any problems. Otherwise, this may easily become the bottleneck of your application. ... The most important aspects to take into account in order to find the right value are the expected application load and concurrency level. ... By default, this pool is limited to 5 ... When max_connections is reached, the server rejects any other connection — as opposed to what happens at the application layer, where requests are queued. Finding a good value for max_connections is complex'


Effective Cache size 'Setting effective_cache_size to 1/2 of total memory would be a normal conservative setting,'

Goals & Motivations

Actions Taken

Results Observed

The following are the queries and resulting execution times on the test-scale database:

The following are the queries and resulting execution times on the production-scale database:


The following are the stress test results on the test-scale service for constant load:

The following are the stress test results on the test-scale service for ramped load:

The following are the stress test results on the production-scale service for constant load:

The following are the stress test results on the production-scale service for ramped load:

Results Impacts

Reflections

Server Deployment - Postgres

I had to decide if/how to ensure authentication, and the methodology for doing so. One option was whether to use the default global admin profile or generate a new profile with the minimum necessary privileges. I chose to create a new profile as then customizations to it were less likely to cause system issues that are hard to reverse. Additionally, it provided another layer of security as that user only had specific privileges to CRUD within the database. In the event that that it was compromised, a hacker couldn't drop the database, create new tables, etc.

===

I needed to securely open access to the database that I had running on my EC2 instance in order to run databases on separate servers from my central Express API app but without allowing anyone with the IP address access to database CRUD operations.

This was achieved through first defining a role/user that requires a password. Next the user was granted only specific privileges to a specific database. This limited role would be the profile through which my database is controlled by my Express app. Finally, once this secure access was created, I edited the config file for Postgres to allow calls from any IP address, but also to require all calls to supply a password with md5 authentication.

This allowed me to securely access my database stored on a separate EC2 instance than my express app, as a user with limited privileges and an associated encrypted password must be supplied in order to interact with the database. This allows me to move the database away from my Express app. This is the first step in scaling my database horizontally through having multiple database instances running, rather than one on the server where the Express app is running.

Server Deployment - Postgres - Performance and Stress Testing - Initial

Loader IO was used.

All getProductReview tests completed predictably, even though this API route usually takes longer. Meanwhile, the getReview Route had issues at 100rps & 1000rps. Some tests behaved strangely where halfway through the run, such as for 100rp, the RPS increased to double what was specified, and caused the system to stall out. Making a new identical test would exhibit different behavior, such as low response rates initially, a spike, then a crash. For 1000 rps in getReview, it crashed on ramping up.

Initial - getReview

1 rps

LoaderIO_getReview_initial-1

10 rps

LoaderIO_getReview_initial-10

100 rps

LoaderIO_getReview_initial-100

1000 rps

LoaderIO_getReview_initial-1000

Initial - getProductReviews

1 rps

LoaderIO_getProductReviews_initial-1

10 rps

LoaderIO_getProductReviews_initial-10

100 rps

LoaderIO_getProductReviews_initial-100

1000 rps

LoaderIO_getProductReviews_initial-1000

SDC Final Interview with Michelle - Oct 20, 2021

The following key takeaway is a sample question that I could have answered better. I ultimately answered it, but indirectly and with some nudging needed. I need to catch on to the key point and not get distracted with the details.

We are a software company looking for someone to come in and help us rework our back end to better scale with our app. If we hired you for this, how wouod you go about helping us expand our system?

I would first make sure I understand the existing system and run various metrics-based tests on the current behavior of the system. I should learn where it performs well, where the bottlenecks are, and what the system limitations are and how they stack up against the company's demands for the system. Only after this is done should I begin thining about how to improve their system.

SDC Final Interview with Leslie - Nov 3, 2021

Microservice vs. Database

These two terms have very different connotations to interviewers, and I should be careful which one I use where. Microservice indicates that I am working full stack and will have more limits to my database knowledge. Database indicates that I am looking to specialize on the back end and should be expected to know the ins & outs of databases to a much higher level.

For example, I had limits to what I could do/understand with optimizing databases with system configurations such as max connections and buffer allotment. This was due to it being a more difficult and indeterminate problem that must be calibrated to each OS & hardware setup. It requires deep knowledge of those that I do not have nor am seeking out.

Putting Myself Down

I mentionedmy failures on the project (failing to finish Mongo, not finishing cloud testing yet), but with the intent of it saying why I ended up in a different place than expected, or why some knowledge is partial. I figured it was better to be up front about it, give context to answers, but it still came out more negative than it could. Leslie had some ideas for how I could achieve those ends better.

Mongo/Postgres

For Mongo/Postgres, my intent was to interact a lot more with Mongo on main traffic, but never did because 2 databases were too much, so I had designed a system for 2, implemented one, and implemented the one that, while more fundamental & important to get going first, was of secondary interest.

I could instead say that:

  1. Mongo was designed, but Postgres was implemented first for the reasons I did so. No need to mention Mongo intentions.
  2. The Postgres implementation & optimization alone met the requirements, so there was no need to build out Mongo, but it is an option if needed in the future.

Cloud Testing

Just mention what I did, not what I didn't do. Again, bring up intentions as plans that could be fleshed out in the future, otherwise, say less and say constructively that 'I don't know' if asked (see below). While context is handy, one problem with bringing up extra material is that interviewers tend to drill down (e.g. ask 'why' 3 times to get to the real answer), so bringing it up will likely induce further questioning.

When I Don't Know an Answer

Talk. More. Slowly. To. Buy. Time... ;-P

Also, apparently when I am reasoning out loud, it sounds too confident as if I know the answer (or think I do) and if it is wrong, then it comes across as bullshitting. A different approach would be to say things like 'that is an interesting problem, let me think about it...', 'how about' and reason through more basic setup steps as well, similar to if I were whiteboarding.

Some Technical Strategies/Answers to Remember

Caching Layer Update Strategies

Whether for Redis, or for using Mongo as a caching database, these methods would be handled in a similar way.

Reads are rather straightforward: Store the data in the caching layer with an associated key (e.g. API request), and for every read, check for the existence of this key first and use the cached data if it exists.

CUD is trickier.

I already had worked out that if you do an update, do it to Postgrs first and it if is successful, update caches on the return. If changing a single piece of data, this is straightforward enough, but what if I change a user name, which is associated with many reviews? In this case, there is much more to update (or invalidate for a fresh cache) on the return.

I had already reasoned having an active async process updating cached DBs to Postgres (or a primary DB that is cloned for load balanced servers). For our app, data accuracy was OK to lag a bit in the areas that change often, I reasoned. This same strategy could be used for updating caches more directly from CUD operations. The user already has the (theoretically) updated state and doesn't need to receive anything back. The return response is on a critical path.

The response can be handled by removing it from the critical path. The flags for updating caches merely trigger an asycn update operation and continue on back to the client. The update operation can do the cache updates on a different system on the server (or different server) to avoid bogging down system resources alloted.

Round Robin vs. Weighted Load Balancer Strategies

If you have a load balancer handling 4 identical servers, go Round Robin as there is no penalty for switching between servers. However, if one server handled by the load balancer is itself a load balancer to other servers, a weighted strategy that weights the child balancer should be used. This is because this more equally taps into ALL servers that can handle API requests since this server may then use Round Robin on it's child servers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment