logkcal/swe.md

## swe.md

      
    Raw
  

              swe.md
            
          
    Cramming


http://nbviewer.jupyter.org/github/henry4j/-/blob/master/man/episode1-3.ipynb
http://nbviewer.jupyter.org/github/henry4j/-/blob/master/man/episode4-7.ipynb
http://nbviewer.jupyter.org/github/henry4j/-/blob/master/man/episode9-11.ipynb
http://nbviewer.jupyter.org/github/henry4j/-/blob/master/man/episode17-18.ipynb
http://nbviewer.jupyter.org/github/henry4j/-/blob/master/man/episodeDP.ipynb
http://nbviewer.jupyter.org/github/henry4j/-/blob/master/man/episodeA-Z.ipynb
http://lethain.com/introduction-to-architecting-systems-for-scale/

Interests


Search Content Ranking, Search Learning and Evaluation, or Applied ML Platform.

Back of the Envelope _{give task time}


Typically, requests are IO bound, not CPU bound, e.g., 64GB ram/ task memory 40MB/ task time 100ms = 16K RPS.
CPU bound: 40 cores on EC2, task frequency 10 = 400 RPS. 4K RPS seems easy. 8K RPS seems to be quite good.

How we've scaled Dropbox http://youtu.be/PE4gwstWhmc

CAP theorem -- it is impossible for a distributed data store to simultaneously provide more than 2 out of the 3 guarantees.

C: Every read receives the most recent write or an error.
A: Every request receives a (non-error) response – w/o guarantee that it contains the most recent write.
P: The system continues to operate despite an arbitrary no. of messages being dropped (or delayed) by the network between nodes.

Partitioning, Replication, and Sentinel

Partitioning


partitioning is splitting data into multiple instances, so each instance only contains a subset of keys.
two main goals: allowing much larger data sets beyond a memory limit on a single computer, and scaling IO and computational power to multi-cores, computer, and network adapters.
alternatives

range partitioning (user id 0 to 10,000 goes to instance 0).
hash partitioning (a hashcode modulo by # of instances): consistent hashing?
impl.: client side, proxy assisted, query routing (a rand instance), or a hybrid (redir.).


disadvantages: intersection between sets, transactions over multiple keys, a big sorted set for a single key, operational complexity, adding or removing capacity.
re-balancing, or pre-sharding in 2^5 instances w/ replication can do the move with minimal or no downtime.

Replication


Keep the slave updated w/ a stream of commands, while connected, a partial sync when disconnected, or a full sync.
Full synchronization transfers the database file after a background saving process, and then all buffered commands.
Asynchronous replication configures the minimum # of attached replicas, that have a lag no greater than X seconds.

Sentinel of gossip and agreement protocol


if a given Sentinel gets reports that the master is not working from enough Sentinels in a given time range, the SDOWN is promoted to ODOWN. However, after the failover is triggered, in order for the failover to actually be performed, at least a majority of Sentinels must authorize the Sentinel to failover.

Distributed Lock


Do acquire a lock on a single instance, SET resource1 uuid1 NX EX 30  # this resource lock validity expires in 30s.
DO acquire locks on a majority of instances sequentially, and the validity time is subtracted by the elapsed time.

Consensus Problem - Raft


Symmetric and Asymmetric approaches to consensus (leader-less or -based).
Using a leader makes normal operations simple and more efficient w/o conflicts.

ML Performance Improvement


Reframe/change the type of prediction problem: regression, classification, time series, anomalies, recommendations, ...
Data: Get & create more data, clean & resample data, rescale, transform & project data, select & engineer features.
Algorithms: K-fold cross validation, evaluation metrics, bias & variance, extending classic methods, and standard configs.
Tuning: diagnostics, intuitions, literature, random & grid search, optimize, ...
Ensembles: bagging, boosting, and stacking, ...

ML ANN System


activation fn: Leaky Relu, Relu (output 0), hyperbolic tangent (-1 to 1), logistic (binary class), softmax (mutex classes).

ELU: average output closer to 0 with negatives, always non-zero derivatives, and smooth w/o bouncing around 0.


LTU (linear threshold unit) computes a weighted sum of input & bias neurons, applies a step fn, and outputs the class.
softmax output layer replaces individual activation functions with a shared softmax function, when classes are exclusive.
forward-mode autodiff traverses the graph once for each variable -- reverse-mode autodiff does twice w/o r.t. # of vars.
hyperparameters for NN: # of hidden layers, # of neurons on each layer, and activation fn in hidden and output layers.
default DNN conf: HE init., ELU activation, batch normalization, dropout, Adam optimizer, no learning rate schedule.
batch normalization estimates the mean & SD on inputs in mini-batches, and shifting/scaling them before activiation fn.
Feed Forward(FF), Recurrent (RNN), Long/short term memory (LSTM), Gated recur. unit (GRU), Auto encoder (AE).
RNN: seq2seq, time series/stock prices, seq2vec, a movie review to a sentiment score, vec2seq, a image to a caption.

ML system


overfitting -- TODO: simplify or constrain the model, gather more data, or reduce noises (errors and outliers).
underfitting -- TODO: power up or un-constrain models, or feed better features.
bias and variance tradeoff over model complexity: a model's generalization error can be bias (a wrong assumption such as a linear model over quadratic data), variance (too sensitive to training data), and irreducible errors (noises on data).
sampling bias (non-representative test data) happens, when the sample is too small or the sampling method is flawed.
grid search (all combos) vs. random search for a # of random combinations for each hyperparameter at every iteration.
precision and recall, F1 score, and ROC AUC (0.5 to 1.0).

DO prefer PR curve to ROC, whenever the positive class is rare or when you care more about the FP than the FN.


RMSE (L2) is more sensitive to outliers than MAE (L1) -- the higher the norm index the more it focuses on large values.
stratified sampling prevents sampling bias from purely random sampling, when the dataset is not large enough.
SGD, mini-batch, or batch GD w/ a training set of millions of features; normal equation: computationally prohibitive.
Regularized models (Lasso L1, Ridge L2, and Elastic Net mix) are sensitive to the scale of input features.
Softmax regr' is a generalization of logistic regr' to support multiclass, where the cost function is cross entropy.
linear vs. non-linear SVM (kernelized) for large margin classification -- runs in O(m x n) vs. O(m**2 x n**2) time.
decision trees are white box models, as decisions are easy to interpret -- fairly intuitive; can even manually applied.
soft voting often achieves higher performance than hard voting, b/c it gives more weights on higly confident votes.
bagging allows training instances to be sampled several times across predictors as well as for the same predictor.
ada boosting trains models w/ increasing weights on hard training instances, and takes accuracy-weighted voting.
gradient boosting fits models to residual errors from predecessors instead of tweaking training instance weights.
stacking trains the finial predictor (a blender or a meta learner) to aggregate predictions of all input predictions.
https://www.springboard.com/blog/machine-learning-interview-questions/
https://www.analyticsvidhya.com/blog/2016/09/40-interview-questions-asked-at-startups-in-machine-learning-data-science/
https://elitedatascience.com/machine-learning-interview-questions-answers

ML Cheat


http://blogs.sas.com/content/subconsciousmusings/2017/04/12/machine-learning-algorithm-use/#prettyPhoto
http://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/
http://unsupervisedmethods.com/cheat-sheet-of-machine-learning-and-python-and-math-cheat-sheets-a4afe4e791b6