- http://nbviewer.jupyter.org/github/henry4j/-/blob/master/man/episode1-3.ipynb
- http://nbviewer.jupyter.org/github/henry4j/-/blob/master/man/episode4-7.ipynb
- http://nbviewer.jupyter.org/github/henry4j/-/blob/master/man/episode9-11.ipynb
- http://nbviewer.jupyter.org/github/henry4j/-/blob/master/man/episode17-18.ipynb
- http://nbviewer.jupyter.org/github/henry4j/-/blob/master/man/episodeDP.ipynb
- http://nbviewer.jupyter.org/github/henry4j/-/blob/master/man/episodeA-Z.ipynb
- http://lethain.com/introduction-to-architecting-systems-for-scale/
- Search Content Ranking, Search Learning and Evaluation, or Applied ML Platform.
Back of the Envelope give task time
- Typically, requests are IO bound, not CPU bound, e.g., 64GB ram/ task memory 40MB/ task time 100ms = 16K RPS.
- CPU bound: 40 cores on EC2, task frequency 10 = 400 RPS. 4K RPS seems easy. 8K RPS seems to be quite good.
How we've scaled Dropbox http://youtu.be/PE4gwstWhmc
CAP theorem -- it is impossible for a distributed data store to simultaneously provide more than 2 out of the 3 guarantees.
- C: Every read receives the most recent write or an error.
- A: Every request receives a (non-error) response – w/o guarantee that it contains the most recent write.
- P: The system continues to operate despite an arbitrary no. of messages being dropped (or delayed) by the network between nodes.
- partitioning is splitting data into multiple instances, so each instance only contains a subset of keys.
- two main goals: allowing much larger data sets beyond a memory limit on a single computer, and scaling IO and computational power to multi-cores, computer, and network adapters.
- alternatives
- range partitioning (user id 0 to 10,000 goes to instance 0).
- hash partitioning (a hashcode modulo by # of instances): consistent hashing?
- impl.: client side, proxy assisted, query routing (a rand instance), or a hybrid (redir.).
- disadvantages: intersection between sets, transactions over multiple keys, a big sorted set for a single key, operational complexity, adding or removing capacity.
- re-balancing, or pre-sharding in 2^5 instances w/ replication can do the move with minimal or no downtime.
- Keep the slave updated w/ a stream of commands, while connected, a partial sync when disconnected, or a full sync.
- Full synchronization transfers the database file after a background saving process, and then all buffered commands.
- Asynchronous replication configures the minimum # of attached replicas, that have a lag no greater than X seconds.
Sentinel of gossip and agreement protocol
- if a given Sentinel gets reports that the master is not working from enough Sentinels in a given time range, the SDOWN is promoted to ODOWN. However, after the failover is triggered, in order for the failover to actually be performed, at least a majority of Sentinels must authorize the Sentinel to failover.
- Do acquire a lock on a single instance,
SET resource1 uuid1 NX EX 30
# this resource lock validity expires in 30s. - DO acquire locks on a majority of instances sequentially, and the validity time is subtracted by the elapsed time.
Consensus Problem - Raft
- Symmetric and Asymmetric approaches to consensus (leader-less or -based).
- Using a leader makes normal operations simple and more efficient w/o conflicts.
- Reframe/change the type of prediction problem: regression, classification, time series, anomalies, recommendations, ...
- Data: Get & create more data, clean & resample data, rescale, transform & project data, select & engineer features.
- Algorithms: K-fold cross validation, evaluation metrics, bias & variance, extending classic methods, and standard configs.
- Tuning: diagnostics, intuitions, literature, random & grid search, optimize, ...
- Ensembles: bagging, boosting, and stacking, ...
- activation fn: Leaky Relu, Relu (output 0), hyperbolic tangent (-1 to 1), logistic (binary class), softmax (mutex classes).
- ELU: average output closer to 0 with negatives, always non-zero derivatives, and smooth w/o bouncing around 0.
- LTU (linear threshold unit) computes a weighted sum of input & bias neurons, applies a step fn, and outputs the class.
- softmax output layer replaces individual activation functions with a shared softmax function, when classes are exclusive.
- forward-mode autodiff traverses the graph once for each variable -- reverse-mode autodiff does twice w/o r.t. # of vars.
- hyperparameters for NN: # of hidden layers, # of neurons on each layer, and activation fn in hidden and output layers.
- default DNN conf: HE init., ELU activation, batch normalization, dropout, Adam optimizer, no learning rate schedule.
- batch normalization estimates the mean & SD on inputs in mini-batches, and shifting/scaling them before activiation fn.
- Feed Forward(FF), Recurrent (RNN), Long/short term memory (LSTM), Gated recur. unit (GRU), Auto encoder (AE).
- RNN: seq2seq, time series/stock prices, seq2vec, a movie review to a sentiment score, vec2seq, a image to a caption.
- overfitting -- TODO: simplify or constrain the model, gather more data, or reduce noises (errors and outliers).
- underfitting -- TODO: power up or un-constrain models, or feed better features.
- bias and variance tradeoff over model complexity: a model's generalization error can be bias (a wrong assumption such as a linear model over quadratic data), variance (too sensitive to training data), and irreducible errors (noises on data).
- sampling bias (non-representative test data) happens, when the sample is too small or the sampling method is flawed.
- grid search (all combos) vs. random search for a # of random combinations for each hyperparameter at every iteration.
- precision and recall, F1 score, and ROC AUC (0.5 to 1.0).
- DO prefer PR curve to ROC, whenever the positive class is rare or when you care more about the FP than the FN.
- RMSE (L2) is more sensitive to outliers than MAE (L1) -- the higher the norm index the more it focuses on large values.
- stratified sampling prevents sampling bias from purely random sampling, when the dataset is not large enough.
- SGD, mini-batch, or batch GD w/ a training set of millions of features; normal equation: computationally prohibitive.
- Regularized models (Lasso L1, Ridge L2, and Elastic Net mix) are sensitive to the scale of input features.
- Softmax regr' is a generalization of logistic regr' to support multiclass, where the cost function is cross entropy.
- linear vs. non-linear SVM (kernelized) for large margin classification -- runs in O(m x n) vs. O(m**2 x n**2) time.
- decision trees are white box models, as decisions are easy to interpret -- fairly intuitive; can even manually applied.
- soft voting often achieves higher performance than hard voting, b/c it gives more weights on higly confident votes.
- bagging allows training instances to be sampled several times across predictors as well as for the same predictor.
- ada boosting trains models w/ increasing weights on hard training instances, and takes accuracy-weighted voting.
- gradient boosting fits models to residual errors from predecessors instead of tweaking training instance weights.
- stacking trains the finial predictor (a blender or a meta learner) to aggregate predictions of all input predictions.
- https://www.springboard.com/blog/machine-learning-interview-questions/
- https://www.analyticsvidhya.com/blog/2016/09/40-interview-questions-asked-at-startups-in-machine-learning-data-science/
- https://elitedatascience.com/machine-learning-interview-questions-answers
- http://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/#prettyPhoto
- http://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/
- http://unsupervisedmethods.com/cheat-sheet-of-machine-learning-and-python-and-math-cheat-sheets-a4afe4e791b6