Skip to content

Instantly share code, notes, and snippets.

@logkcal
Last active August 25, 2017 16:48
Show Gist options
  • Save logkcal/dbf5c3b1832ea830a2c2e710f1238bf3 to your computer and use it in GitHub Desktop.
Save logkcal/dbf5c3b1832ea830a2c2e710f1238bf3 to your computer and use it in GitHub Desktop.
System Design & ML System Design

Cramming

Interests

  • Search Content Ranking, Search Learning and Evaluation, or Applied ML Platform.

Back of the Envelope give task time

  • Typically, requests are IO bound, not CPU bound, e.g., 64GB ram/ task memory 40MB/ task time 100ms = 16K RPS.
  • CPU bound: 40 cores on EC2, task frequency 10 = 400 RPS. 4K RPS seems easy. 8K RPS seems to be quite good.

How we've scaled Dropbox http://youtu.be/PE4gwstWhmc

CAP theorem -- it is impossible for a distributed data store to simultaneously provide more than 2 out of the 3 guarantees.

  • C: Every read receives the most recent write or an error.
  • A: Every request receives a (non-error) response – w/o guarantee that it contains the most recent write.
  • P: The system continues to operate despite an arbitrary no. of messages being dropped (or delayed) by the network between nodes.

Partitioning, Replication, and Sentinel

  • partitioning is splitting data into multiple instances, so each instance only contains a subset of keys.
  • two main goals: allowing much larger data sets beyond a memory limit on a single computer, and scaling IO and computational power to multi-cores, computer, and network adapters.
  • alternatives
    • range partitioning (user id 0 to 10,000 goes to instance 0).
    • hash partitioning (a hashcode modulo by # of instances): consistent hashing?
    • impl.: client side, proxy assisted, query routing (a rand instance), or a hybrid (redir.).
  • disadvantages: intersection between sets, transactions over multiple keys, a big sorted set for a single key, operational complexity, adding or removing capacity.
  • re-balancing, or pre-sharding in 2^5 instances w/ replication can do the move with minimal or no downtime.
  • Keep the slave updated w/ a stream of commands, while connected, a partial sync when disconnected, or a full sync.
  • Full synchronization transfers the database file after a background saving process, and then all buffered commands.
  • Asynchronous replication configures the minimum # of attached replicas, that have a lag no greater than X seconds.

Sentinel of gossip and agreement protocol

  • if a given Sentinel gets reports that the master is not working from enough Sentinels in a given time range, the SDOWN is promoted to ODOWN. However, after the failover is triggered, in order for the failover to actually be performed, at least a majority of Sentinels must authorize the Sentinel to failover.
  • Do acquire a lock on a single instance, SET resource1 uuid1 NX EX 30 # this resource lock validity expires in 30s.
  • DO acquire locks on a majority of instances sequentially, and the validity time is subtracted by the elapsed time.

Consensus Problem - Raft

  • Symmetric and Asymmetric approaches to consensus (leader-less or -based).
  • Using a leader makes normal operations simple and more efficient w/o conflicts.

ML Performance Improvement

  • Reframe/change the type of prediction problem: regression, classification, time series, anomalies, recommendations, ...
  • Data: Get & create more data, clean & resample data, rescale, transform & project data, select & engineer features.
  • Algorithms: K-fold cross validation, evaluation metrics, bias & variance, extending classic methods, and standard configs.
  • Tuning: diagnostics, intuitions, literature, random & grid search, optimize, ...
  • Ensembles: bagging, boosting, and stacking, ...

ML ANN System

  • activation fn: Leaky Relu, Relu (output 0), hyperbolic tangent (-1 to 1), logistic (binary class), softmax (mutex classes).
    • ELU: average output closer to 0 with negatives, always non-zero derivatives, and smooth w/o bouncing around 0.
  • LTU (linear threshold unit) computes a weighted sum of input & bias neurons, applies a step fn, and outputs the class.
  • softmax output layer replaces individual activation functions with a shared softmax function, when classes are exclusive.
  • forward-mode autodiff traverses the graph once for each variable -- reverse-mode autodiff does twice w/o r.t. # of vars.
  • hyperparameters for NN: # of hidden layers, # of neurons on each layer, and activation fn in hidden and output layers.
  • default DNN conf: HE init., ELU activation, batch normalization, dropout, Adam optimizer, no learning rate schedule.
  • batch normalization estimates the mean & SD on inputs in mini-batches, and shifting/scaling them before activiation fn.
  • Feed Forward(FF), Recurrent (RNN), Long/short term memory (LSTM), Gated recur. unit (GRU), Auto encoder (AE).
  • RNN: seq2seq, time series/stock prices, seq2vec, a movie review to a sentiment score, vec2seq, a image to a caption.

ML system

  • overfitting -- TODO: simplify or constrain the model, gather more data, or reduce noises (errors and outliers).
  • underfitting -- TODO: power up or un-constrain models, or feed better features.
  • bias and variance tradeoff over model complexity: a model's generalization error can be bias (a wrong assumption such as a linear model over quadratic data), variance (too sensitive to training data), and irreducible errors (noises on data).
  • sampling bias (non-representative test data) happens, when the sample is too small or the sampling method is flawed.
  • grid search (all combos) vs. random search for a # of random combinations for each hyperparameter at every iteration.
  • precision and recall, F1 score, and ROC AUC (0.5 to 1.0).
    • DO prefer PR curve to ROC, whenever the positive class is rare or when you care more about the FP than the FN.
  • RMSE (L2) is more sensitive to outliers than MAE (L1) -- the higher the norm index the more it focuses on large values.
  • stratified sampling prevents sampling bias from purely random sampling, when the dataset is not large enough.
  • SGD, mini-batch, or batch GD w/ a training set of millions of features; normal equation: computationally prohibitive.
  • Regularized models (Lasso L1, Ridge L2, and Elastic Net mix) are sensitive to the scale of input features.
  • Softmax regr' is a generalization of logistic regr' to support multiclass, where the cost function is cross entropy.
  • linear vs. non-linear SVM (kernelized) for large margin classification -- runs in O(m x n) vs. O(m**2 x n**2) time.
  • decision trees are white box models, as decisions are easy to interpret -- fairly intuitive; can even manually applied.
  • soft voting often achieves higher performance than hard voting, b/c it gives more weights on higly confident votes.
  • bagging allows training instances to be sampled several times across predictors as well as for the same predictor.
  • ada boosting trains models w/ increasing weights on hard training instances, and takes accuracy-weighted voting.
  • gradient boosting fits models to residual errors from predecessors instead of tweaking training instance weights.
  • stacking trains the finial predictor (a blender or a meta learner) to aggregate predictions of all input predictions.
  • https://www.springboard.com/blog/machine-learning-interview-questions/
  • https://www.analyticsvidhya.com/blog/2016/09/40-interview-questions-asked-at-startups-in-machine-learning-data-science/
  • https://elitedatascience.com/machine-learning-interview-questions-answers

ML Cheat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment