Skip to content

Instantly share code, notes, and snippets.

@18520339
Last active February 18, 2024 02:41
Show Gist options
  • Save 18520339/a6843aa82b32f6517f5af67cdc985bde to your computer and use it in GitHub Desktop.
Save 18520339/a6843aa82b32f6517f5af67cdc985bde to your computer and use it in GitHub Desktop.
My study notes
#include <iostream>
#include <math.h>
using namespace std;
bool is_prime(int n) {
if (n <= 1) return false;
if (n <= 3) return true;
if (n % 2 == 0 || n % 3 == 0) return false;
for (int i = 5; i * i <= n; i += 6)
if (n % i == 0 || n % (i + 2) == 0) return false;
return true;
}
bool is_fibo(int n) {
int n1 = n * n * 5 - 4;
int n2 = n * n * 5 + 4;
float sqrt1 = sqrt(n1);
float sqrt2 = sqrt(n2);
return (int)sqrt1 == sqrt1 || (int)sqrt2 == sqrt2;
}
int get_fibo(int n) {
double phi = (1 + sqrt(5)) / 2;
return round(pow(phi, n) / sqrt(5));
}
// SAKAMOTO ALGORITHM to checks what day of the week it is
int day_of_week(int year, int month, int day) {
int t[] = {0, 3, 2, 5, 0, 3, 5, 1, 4, 6, 2, 4};
year -= month < 3;
return (497 * year/400 + t[month - 1] + day) % 7;
}
function getWebName(url) {
// http://example1.com/a/b?c=d => example1
// http://www.example2.com/b?c=d => example2
// https://ww.example3.com.vn => example3
const hostnameParts = new URL(url).hostname.split('.');
return hostnameParts[hostnameParts.length - 1].length === 2
? hostnameParts[hostnameParts.length - 3]
: hostnameParts[hostnameParts.length - 2];
}
// Check even and odd without `if else`
number = 3
["even", "odd"][number % 2]
// Get intersection
const a = new Set([1,2,3]);
const b = new Set([4,3,2]);
const intersection = [...a].filter(x => b.has(x))
console.log(intersection) // [2, 3]
function getCookieField(name) {
const cookie = document.cookie.split("; ").find(item => item.startsWith(`${name}=`));
return cookie ? decodeURIComponent(cookie.split("=")[1]) : null;
}
(265 >>> 0).toString(2);
(_$=($,_=[]+[])=>$?_$($>>+!![],($&+!![])+_):_)(265);
/*
Đây ko phải là RegEx mà là hàm mũi tên (arrow function) với các tên hàm, tên biến và số (1) được thể hiện bằng các kí tự đặc biệt và sô 1 được thể hiện bằng biểu thức mảng như này +!![]
Đây là phiên bản dễ hiểu hơn một chút của đoạn mã:
(toBinary = (val, str = "") => val ? toBinary(val >> 1, (val & 1) + str):str)(265);
[]+[] chính là chuỗi trống "".
+!![] chính là số 1.
Dùng đệ quy để lấy từng bit và cộng dồn vào chuỗi str (ban đầu là trống ""). Điều kiện dừng là val bằng 0 (đoạn toán tử 2 ngôi chỗ val?... đấy).
Viết cho dễ nhìn và chú thích:
(
toBinary = (val, str = "") => // gán toBinary cho hàm mũi tên với 2 tham số val và str (mặc định là "").
val ? // nếu val khác 0...
toBinary(val >> 1, (val & 1) + str) : // ... thì thực hiện đệ quy cho bit tiếp theo
str // ...ngược lại kết thúc đệ quy và trả về giá trị
)(265); // gọi trực tiếp hàm toBinary
*/
@18520339
Copy link
Author

18520339 commented Sep 2, 2023

Polynomial features

The value of $\mathbf{w}$, [0.08 0.54 0.03] and b is 0.0106.This implies the model after fitting/training is: $0.08x + 0.54x^2 + 0.03x^3 + 0.0106$

Gradient descent has emphasized the data that is the best fit to the $x^2$ data by increasing the $w_1$ term relative to the others. If you were to run for a very long time, it would continue to reduce the impact of the other terms.

Gradient descent is picking the 'correct' features for us by emphasizing its associated parameter

Let's review this idea:

  • Intially, the features were re-scaled so they are comparable to each other
  • less weight value implies less important/correct feature, and in extreme, when the weight becomes zero or very close to zero, the associated feature is not useful in fitting the model to the data.
  • above, after fitting, the weight associated with the $x^2$ feature is much larger than the weights for $x$ or $x^3$ as it is the most useful in fitting the data.

Above, polynomial features were chosen based on how well they matched the target data. Another way to think about this is to note that we are still using linear regression once we have created new features. Given that, the best features will be linear relative to the target.

@18520339
Copy link
Author

18520339 commented Sep 3, 2023

Caculate $\frac{\partial{L}}{\partial{w}}$ without chain rule, where L is loss function of a simple Logistic Regression model with binary cross-entropy:

$$\eqalign{ L &= -[y * ln(\frac{1}{1 + e^{-(wx+b)}}) + (1 - y) * ln(1 - \frac{1}{1 + e^{-(wx+b)}})] \\ &= -[y * ln(\frac{1}{1 + e^{-(wx+b)}}) + (1 - y) * ln(\frac{e^{-(wx+b)}}{1 + e^{-(wx+b)}})] \\ }$$

Now, we used the fact that $ln(\frac{1}{1 + e^{-z}}) = -ln(1 + e^{-z})$ and $ln(\frac{e^{-z}}{1 + e^{-z}}) = -z - ln(1 + e^{-z})$, where $z = wx+b$:

$$\eqalign{ L &= -[-y * ln(1 + e^{-(wx+b)}) + (1 - y) * (-(wx+b) - ln(1 + e^{-(wx+b)}))] \\ &= -[-y * ln(1 + e^{-(wx+b)}) - (1 - y) * ln(1 + e^{-(wx+b)}) - (1 - y) * (wx+b)] \\ &= ln(1 + e^{-(wx+b)}) + (1 - y) * wx + (1 - y) * b] \\ }$$

Now, find $\frac{\partial{L}}{\partial{w}}$:

$$\eqalign{ \frac{\partial{L}}{\partial{w}} &= \frac{-x * e^{-(wx+b)}}{1 + e^{-(wx+b)}} + (1 - y) * x] \\ &= \frac{-x * e^{-(wx+b)} + (1 - y) * x * (1 + e^{-(wx+b)})}{1 + e^{-(wx+b)}} \\ &= \frac{-x * e^{-(wx+b)} + (1 - y) * x + (1 - y) * x * e^{-(wx+b)}}{1 + e^{-(wx+b)}} \\ &= \frac{-xy * e^{-(wx+b)} + x - xy}{1 + e^{-(wx+b)}} \\ &= \frac{-xy * (1 + e^{-(wx+b)}) + x}{1 + e^{-(wx+b)}} \\ &= -xy + \frac{x}{1 + e^{-(wx+b)}} \\ &= (\frac{1}{1 + e^{-(wx+b)}} - y) * x \\ &= (sigmoid(z) - y) * x }$$

@18520339
Copy link
Author

18520339 commented Sep 7, 2023

image image

@18520339
Copy link
Author

18520339 commented Sep 7, 2023

Backprop

image

Compute $\frac{\partial{J}}{\partial{a}}$ once & use it to compute both $\frac{\partial{J}}{\partial{w}}$ and $\frac{\partial{J}}{\partial{b}}$

If N nodes and P parameters (w1, w2, b1, b2, ...), compute derivatives in roughly N + P steps rather than N x P steps => Compute all derivatives quickly, which can then feed into Gradient Descent or Adam.

image image
image image

A computation graph simplifies the computation of complex derivatives by breaking them into smaller steps

@18520339
Copy link
Author

18520339 commented Sep 9, 2023

Standard Error (SE)

In statistics, the std of a sample statistic is called the SE. The SE of the mean measures variability among all your sample means. A smaller SE = less variability, implying that the sample mean is a more accurate estimate of the population mean.

SE is based on the idea of repeated sampling, but in reality, researchers typically work with only one sample due to the complexity, cost, or time constraints of taking multiple samples. Statisticians have created a formula for calculating the SE based on the assumption of repeated sampling: Sample std / sqrt(Sample size).

SE measures the difference between your sample mean and the actual population mean. As your sample gets larger, your sample mean gets smaller or closer to the actual population mean. Say you collect a sample of 10,000 penguins. You find that the sample mean weight is 3 pounds and the sample std is 1 pound. The SE is 1/sqrt(10,000) = 0.01. Your best estimate for the sample mean will still be 3 pounds, but now you can expect that the mean weight from one sample of penguins to the next will vary with a std of just 0.01 pounds.

@18520339
Copy link
Author

18520339 commented Sep 9, 2023

Error analysis

If you have a larger cross validation set, say we had 5,000 cross validation examples and if the algorithm misclassified say 1,000 of them then you may not have the time depending on the team size and how much time you have to work on this project. You may not have the time to manually look at all 1,000 examples that the algorithm is classifies. In that case, I will often sample randomly a subset of usually around a 100, maybe a couple 100 examples that the model misclassified in order to identify common traits and trends.

Error analysis can be a bit harder for tasks that even humans aren't good at. For example, if you're trying to predict what ads someone will click on on the website. Well, I can't predict what someone will click on. Error analysis there actually tends to be more difficult. But when you apply error analysis to problems that you can it can be extremely helpful for focusing attention on the more promising things to try. By identifying similar types of errors, you can collect more data that are similar to these misclassified examples in order to train the model to improve on these types of examples.

@18520339
Copy link
Author

18520339 commented Sep 10, 2023

Feature scaling

As with the training set, you will also want to scale the cross validation set. An important thing to note when using the z-score is you have to use the mean and standard deviation of the training set when scaling the cross validation set. This is to ensure that your input features are transformed as expected by the model. One way to gain intuition is with this scenario:

  • Say that your training set has an input feature equal to 500 which is scaled down to 0.5 using the z-score.
  • After training, your model is able to accurately map this scaled input x=0.5 to the target output y=300.
  • Now let's say that you deployed this model and one of your users fed it a sample equal to 500.
  • If you get this input sample's z-score using any other values of the mean and standard deviation, then it might not be scaled to 0.5 and your model will most likely make a wrong prediction (i.e. not equal to y=300).

@18520339
Copy link
Author

18520339 commented Sep 10, 2023

ReLU

Neural networks can learn non-linear relationships so we can opt to skip adding polynomial features like in Linear Regression.

image

@18520339
Copy link
Author

18520339 commented Sep 10, 2023

Transfer learning

image

The earlier layers of the model may be reusable as is, because they are identifying low level features that are relevant to your task. Or, it may help to train all the layers of the model on your own training set. This may take more time compared to if you just trained the parameters of the output layers.

@18520339
Copy link
Author

18520339 commented Sep 11, 2023

Confidence interval

The 4 steps for constructing a confidence interval:

  1. Identify a sample statistic
  2. Choose a confidence level
  3. Find the margin of error (SE * (t)z-score)
  4. Calculate the interval

As confidence level gets higher, confidence interval gets wider and as sample size gets higher, confidence interval gets more narrow. Confidence intervals for small sample sizes (<30) or t-test only deal with population means, and not population proportions.

If you take repeated random samples from the population, and construct a confidence interval for each sample using the same method, you can expect 95% of your intervals to capture the population mean.

  • Imagine you take 20 random samples of 100 penguins each from the penguin population, and calculate a 95% confidence interval for each sample. You can expect that approximately 19 of the 20 intervals, or 95% of the total, will contain the actual population mean weight of 31 pounds. One such interval will be the range of values between 28 pounds and 32 pounds.

@18520339
Copy link
Author

18520339 commented Sep 11, 2023

Hypothesis testing

Null hypothesis (H0) Alternative hypothesis (Ha)
Claims There is no effect in the population. There is an effect in the population.
Language No effectNo differenceNo relationshipNo change An effectA differenceA relationshipA change
Symbols Equality (=, ≤, ≥) Inequality (≠, <, >)

The steps for conducting a hypothesis test:

  1. State the null hypothesis and the alternative hypothesis: The null and alternative hypotheses are always claims about the population. That’s because the aim of hypothesis testing is to make inferences about a population based on a sample.
  2. Choose a significance level: Typically, 0.05. That means results at least as extreme as yours only have a 5% chance (or less) of occurring when the null hypothesis is true. A lower significance level means an effect has to be larger to be considered statistically significant.
  3. Find the p-value:
  • The probability that the observed data (or data more extreme) could have occurred under the null hypothesis.
  • The probability of observing results as or more extreme than those observed when the null hypothesis is true. In the context of hypothesis testing, “extreme” means extreme in the direction(s) of the alternative hypothesis.
  1. Reject or fail to reject the null hypothesis.

Use a z-test when the population std is known, and use a t-test when the population std is unknown and needs to be estimated from the data. t-distribution has bigger tails than the standard normal distribution does. The bigger tails indicate the higher frequency of outliers that come with small dataset.

@18520339
Copy link
Author

18520339 commented Sep 12, 2023

Type I and Type II errors

A Type I error means rejecting a null hypothesis which is actually true. In general, making a Type I error often leads to implementing changes that are unnecessary and ineffective, and which waste valuable time and resources.

A Type II error means failing to reject a null hypothesis which is actually false. In general, making a Type II error may result in missed opportunities for positive change and innovation. A lack of innovation can be costly for people and organizations.

image

  • Reduce Type I error by choosing a lower significance level (α). Reducing your risk of making a Type I error means you are more likely to make a Type II error (β).
  • β is related to the power of a hypothesis test (power = 1-β). Power refers to the likelihood that a test can correctly detect a real effect when there is one.
  • You can reduce your risk of making a Type II error by ensuring your test has enough power. In data work, power is usually set at 0.80 or 80%. The higher the statistical power, the lower the probability of making a Type II error. To increase power, you can increase your sample size or your significance level.

@18520339
Copy link
Author

18520339 commented Sep 12, 2023

One-tailed test

Says, your test statistic is a z-score of 1.75 and your p-value is 0.04. In a left-tailed test, the p-value is the probability that the z-score < 1.75 standard units away from the mean to the left (z-score < -1.75). The probability of getting a value < your z-score of -1.75 is calculated by taking the area under the distribution curve to the left of the z-score.

image Note that the p-value for a 2-tailed test is always 2x the p-value for a 1-tailed test. So, in that case, your p-value = 0.04 + 0.04 = 0.08. A one-tailed test may be appropriate if the negative consequences of missing an effect in the untested direction are minimal.

@18520339
Copy link
Author

18520339 commented Sep 12, 2023

Experimental design

  1. Defining the independent and dependent variables in their experiment:
    • In your clinical trial, you want to find out how the medicine affects recovery time. Therefore:
      • Your independent variable is the medicine—the cause you want to investigate.
      • Your dependent variable is recovery time—the effect you want to measure.
    • In a more complex experiment, you might test the effect of different medicines on recovery time, or different doses of the same medicine. In each case, you manipulate your independent variable (medicine) to measure its effect on your dependent variable (recovery time).
  2. Formulate your hypothesis: H0 is that the medicine has no effect. Ha is that the medicine is effective.
  3. Assign test subjects to treatment and control groups:
    • Experiments such as clinical trials and A/B tests are controlled experiments. A typical A/B test has at least 3 main features: Test design, Sampling, Hypothesis testing.
    • In a controlled experiment, test subjects are assigned to a treatment group and a control group. The treatment is the new change being tested in the experiment. The treatment group is exposed to the treatment. The control group is not exposed to the treatment. The difference in metric values between the 2 groups measures the treatment’s effect on the test subjects.
    • As a next step, you might conduct a 2-sample t-test to determine whether the observed difference in recovery time is statistically significant or due to chance.

@18520339
Copy link
Author

18520339 commented Sep 12, 2023

Randomized controlled experiment

An A/B test is a basic version of what’s known as a randomized controlled experiment. This design allows researchers to control for other factors that might influence the test results and draw causal conclusions about the effect of the treatment.

  • For example, imagine the subjects in your treatment group have a much healthier diet than the subjects in your control group. Any observed decrease in recovery time for the treatment group might be due to their healthier diet—and not to the medicine. In this case, you cannot say with confidence that the medicine alone is the cause of the faster recovery time.

Typically, data professionals randomly assign test subjects to treatment and control groups. Randomization helps control the effect of other factors on the outcome of an experiment:

  • Completely randomized design: test subjects are assigned to treatment and control groups using a random process.
    • For example, in a clinical trial, you might use a computer program to label each subject with a number => randomly select numbers for each group => maybe not effective due to nuisance factors.
  • Randomized block design: minimize the impact of known this by dividing subjects into blocks => randomly assign the subjects within each block to treatment and control groups.
    • For example, you know that people under the age of 35 tend to recover faster than older people => age is a nuisance factor => divide the test subjects into 21-35, 36-50, and 51-65 => randomly assign the subjects within each block to treatment and control groups => more confident that this result is due to the treatment (medicine) and not to the nuisance factor (age).

@18520339
Copy link
Author

18520339 commented Sep 13, 2023

GAN

image image

Discriminator wants the fake examples to seem as fake as possible, but the Generator wants fake examples to seem as real as possible. That is, it wants to fool the Discriminator. The Generator doesn't see the real images. It learns over time by using feedback from the Discriminator.

They need to learn from each other over time and both should be at similar skill levels. If one model significantly better than the other, it doesn’t help the other learn because the feedback isn’t useful. Imagine if you were a beginning artist, and you showed your work to an art expert, asking whether your painting looked like a famous piece and all they said was ‘no’. Because they have a very discerning eye, they know your image is not right, but won’t be able to tell you how close you are.

@18520339
Copy link
Author

18520339 commented Sep 13, 2023

Activations

ReLU with the dying ReLU problem, and sigmoid and tanh with the vanishing ingredient in saturation problems

Sigmoid activation isn't used very often in hidden layers because the derivative of the function approaches 0 at the tails of this function => Vanishing gradient problems, or saturated outputs here at the tails of the function.

You can imagine that this function continues to go in both directions because it can take any real value as input. It is asymptotically approaching 1 at the top and asymptotically approaching 0 on the bottom => Vanishing gradient problems, because you have these saturated outputs here at the tails, and the values will always be ~1 or ~0.

Although Tanh has a similar shape to Sigmoid, its range of -1 to 1 preserves the sign of the input z, so negatives are still negative. That can be useful in some applications. Because it's shape is similar to the sigmoid however, the same saturation and vanishing gradient issue does occur. Again, the tails do extend on both sides of bridging 1 at the top and -1 at the bottom.

@18520339
Copy link
Author

18520339 commented Sep 13, 2023

Batch normalization

  • Batch normalization smooths the cost function and reduces the internal covariate shift.
  • You use the batch mean and standard deviation during training and the running statistics (that was computed over the entire training set) for testing. The running values are fixed after training.

@18520339
Copy link
Author

18520339 commented Sep 14, 2023

Problem with BCE Loss

Mode collapse

Mode collapse happens when the generator gets stuck in one mode. The discriminator will eventually learn to differentiate the generator's fakes when this happens and outskill it, ending the model's learning.

For example, a discriminator has learned to be good at identifying which handwritten digits are fakes, except for cases where the generated images look like 1 and 7. This could mean the discriminator is at of local minima of its cost function. It classifies most of the digits correctly, except for the ones that resembled those 1 and 7, then this information is passed on to the generator. The generator sees this and looks at the feedback from the discriminator and gets a good idea of how to fool the discriminator in the next round.

Vanishing gradient

image image

GANs try to make 2 distributions look similar. When the discriminator improves too much, the function approximated by BCE Loss will contain flat regions = vanishing gradients.

At the beginning, there's some overlap between 2 distributions. However, as it gets better at training, the real distribution will be centered around 1 and the generated distribution will start to approach 0. As a result, when this discriminator is getting better, it'll start giving less informative feedback.

@18520339
Copy link
Author

18520339 commented Sep 25, 2023

Decision tree

image image
image image

The entropy measures the impurity of a set of data. It starts from 0, goes up to 1, and then comes back down to 0 as a function of the fraction of positive examples in your sample:

  • We take log 2 just to make the peak of this curve equal to 1, if we were to take log e or the base of natural logarithms, then that just vertically scales this function -> still work, but a bit hard to interpret.
  • If p1 or p0 = 0 -> 0*log(0), and log(0) is technically undefined (negative infinity). But by convention, we'll take 0*log(0) = 0.
  • Looks a little bit like logistic loss -> there is actually a mathematical rationale for why these 2 formulas look so similar.
  • There are other functions that look like this, they go from 0 up to 1 and then back down such as the Gini criteria.

The information gain measures the reduction in entropy that you get in your tree resulting from making a split. Why do we bother to compute the reduction in entropy rather than just entropy at the left and right sub-branches?:

  • It turns out that one of the stopping criteria for deciding when not to bother to split any further is if the reduction in entropy is too small.
  • In which case, you could decide you're just increasing the size of the tree unnecessarily and risking overfitting by splitting and just decide not to bother if the reduction in entropy is too small or below a threshold.
  • For a continuous-valued feature (such as the weight of the animal), there are 10 animals in the dataset. The recommended way to find the best split for that feature is to choose the 9 mid-points between the 10 examples as possible splits and find the split that gives the highest information gain.

@18520339
Copy link
Author

18520339 commented Sep 25, 2023

Tree ensembles

A single decision tree can be highly sensitive to small changes in the data -> change just 1 training example causes the algorithm to come up with a different split at the root -> a totally different tree that makes this algorithm just not that robust -> use a whole bunch of different trees and vote.

Sample the training data with replacement and select a random subset of features to build each tree so that they are not all identical to each other. if $n$ is the number of features, we will randomly select $\sqrt{n}$ of these features to train each individual tree.

@18520339
Copy link
Author

18520339 commented Sep 26, 2023

Kmeans

Shouldn't use the $elbow method$ to choose the right number of clusters because for a lot of applications, the right number of clusters is truly ambiguous, and you find that a lot of cost functions look like they just decrease smoothly and don't have a clear elbow.

Choose $K$ to minimize the cost function J -> We will almost always just choose the largest possible value of $K$ because having more clusters will pretty much always reduce J -> Evaluate K-means based on how well it performs (how it makes sense) on that later purpose.

The $K$-means algorithm will always converge to some final set of means for the centroids. However, the converged solution may not always be ideal and depends on the initial setting of the centroids.

  • Therefore, in practice, the K-means algorithm is usually run a few times with different random initializations.
  • One way to choose between these different solutions from different random initializations is to choose the one with the lowest cost function value (distortion).

@18520339
Copy link
Author

18520339 commented Sep 26, 2023

Anomaly detection

image image
Try to make sure the features you give it are more or less Gaussian. If your features are not Gaussian, sometimes you can change it to make it a little bit more Gaussian -> Train the model and then to see what anomalies in the cross-validation set the algorithm is failing to detect -> Look at those examples to see if that can inspire the creation of new features that would allow the algorithm to spot. image

@18520339
Copy link
Author

18520339 commented Sep 26, 2023

Collaborative Filtering

image image
image image

If you were to run the algorithm on this dataset, you actually end up with the parameters w = [0 0] & b = 0 for the user Eve. Because Eve hasn't rated any movies yet, w & b don't affect the first term in the cost function because none of Eve's movie ratings play a role in the squared error cost function -> Normalizing the rows so that you can give reasonable ratings.

Normalizing the columns would help if there was a brand-new movie that no one has rated yet. But if there's a brand new movie that no one has rated yet, you probably shouldn't show that movie to too many users initially because you don't know that much about that movie. So normalizing columns is less important than normalizing the rules to hope with the case of a new user that's hardly rated any movies yet.

@18520339
Copy link
Author

18520339 commented Sep 27, 2023

Content-based filtering

image image
image image
image image

The retrieval step tries to prune out a lot of items that are just not worth doing the more detailed influence and inner product on. And then the ranking step makes a more careful prediction for what are the items that the user is actually likely to enjoy. Retrieving more items results in better performance but slower recommendations.

To optimize the trade-off, carry out offline experiments to see if retrieving additional items results in more relevant recommendations (ex: $p(y^{(i,j)})$=1 of items displayed to user are higher). image

image

@18520339
Copy link
Author

18520339 commented Sep 28, 2023

Reinforcement learning

One way to think of why reinforcement learning is so powerful is you have to tell it what to do rather than how to do it:

  • For an autonomous helicopter, you could then train a neural network using supervised learning to directly learn the mapping from the states $s$ (x) to action $a$ (y).
  • But it turns out that when the helicopter is moving through the air is actually very ambiguous. What is the exact right action to take? It's actually very difficult to get a dataset of x and the ideal action y -> For lots of tasks of controlling a robot like a helicopter & other robots, the supervised learning approach doesn't work well, and we instead use reinforcement learning.
  • Specifying the reward function (make it impatient) rather than the optimal action gives you more flexibility in how to design the system.
image image

Reinforcement learning is more finicky in terms of the choice of hyperparameters. For example, in supervised learning, if you set the learning rate a little bit too small, then maybe the algorithm may take 3 times longer to train, which is annoying but maybe not that bad. Whereas in Reinforcement learning, if you set the value of Epsilon or other parameters not good, it may take 10 times or 100 times longer to learn.

@18520339
Copy link
Author

18520339 commented Sep 28, 2023

Bellman Equation

image image
image image

When the RF problem is stochastic, there isn't a sequence of rewards that you see for sure -> what we're interested in is not maximizing the return (because that's a random number) but maximizing the average value of the sum of discounted rewards. In cases where both the state and action space are discrete we can estimate the action-value function iteratively by using the Bellman equation:

$$ Q_{i+1}(s,a) = R + \gamma \max_{a'}Q_i(s',a') $$

This iterative method converges to the optimal action-value function $Q^*(s,a)$ as $i\to\infty$. This means that the agent just needs to gradually explore the state-action space and keep updating the estimate of $Q(s,a)$ until it converges to the optimal action-value function $Q^*(s,a)$:

  • However, in cases where the state space is continuous it becomes practically impossible to explore the entire state-action space. Consequently, this also makes it practically impossible to gradually estimate $Q(s,a)$ until it converges to $Q^*(s,a)$.

In the Deep $Q$-Learning, we solve this problem by using a neural network to estimate the action-value function $Q(s,a)\approx Q^*(s,a)$. It can be trained by adjusting its weights at each iteration to minimize the mean-squared error in the Bellman equation:

  • Using neural networks in RF to estimate action-value functions has proven to be highly unstable -> Use a Target Network (soft update) and Experience Replay storing the agent's states, actions, rewards the agent receives in a memory buffer and then samples a random mini-batch to generate uncorrelated experiences for training agent.
  • Towards the end of training, the agent will lean towards selecting the action that it believes (based on past experiences) will maximize $Q(s,a)$ -> We will set the minimum 𝜖 value = 0.01 (not 0) because we always want to keep a little bit of exploration during training.

@18520339
Copy link
Author

18520339 commented Sep 28, 2023

Deep reinforcement learning

image image
image image
rl_formalism.mp4
In the standard “agent-environment loop” formalism, an agent interacts with the environment in discrete time-steps t=0,1,2,... At each $t$, the agent uses a policy $\pi$ to select an action $A_t$ based on its observation of the environment's state $S_t$. The agent receives a numerical reward $R_t$ and on the next time step, moves to a new state $S_{t+1}$.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment