Skip to content

Instantly share code, notes, and snippets.

@werner
Last active January 17, 2020 23:24
Show Gist options
  • Save werner/86cd63c90b0ba47c8971b8d3a1d3cf97 to your computer and use it in GitHub Desktop.
Save werner/86cd63c90b0ba47c8971b8d3a1d3cf97 to your computer and use it in GitHub Desktop.
Machine learning notes
Focus on data analysis. Data is more important, because we already have the tools for mathematical operations, however is good to be familiar with the models available because some might work better than others.
Seems it needs a lot of trial and error.
Categorical attributes are repetitive values, that's why numeric fields could be categorical.
https://www.dummies.com/education/math/statistics/why-standard-deviation-is-an-important-statistic/
The std
row shows the standard deviation, which measures how dispersed the values are.
Std is important, because it shows more or less the behavior we kind of need to measure, actually std measures average, how far (high std) or close (low std) the values are from the average.
The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that
there is a strong positive correlation; for example, the median house value tends to go
up when the median income goes up. When the coefficient is close to –1, it means
that there is a strong negative correlation; you can see a small negative correlation
between the latitude and the median house value (i.e., prices have a slight tendency to
go down when you go north). Finally, coefficients close to zero mean that there is no
linear correlation.
exploratory data analysis (EDA)
mean feature encoding
**Scaling**
You cant arbitrarily scale, that depends on the model.
In general case, should we apply these transformations to all numeric features, when we train a non-tree-based model?
Yes, we should apply a chosen transformation to all numeric features
Correct
Correct! We use preprocessing to scale all features to one scale, so that their initial impact on the model will be roughly similar. For example, as in the recent example where we used KNN for prediction, this could lead to the case where some features will have critical influence on predictions.
** Outliers **
You should delete them, or perform cliping with seems to be stablish limits upper and lower bounds, also see winsorization
Rank transformation seems good because it makes outliers closer to the data.
Log transformation seems good specially for neural networks.
** Normalization **
You need to look for the labels to be normal.
One way to do it is making them log to the plot look like more like a bell, gaussian distribution.
Normality - When we talk about normality what we mean is that the data should look like a normal distribution. This is important because several statistic tests rely on this (e.g. t-statistics). In this exercise we'll just check univariate normality for 'SalePrice' (which is a limited approach). Remember that univariate normality doesn't ensure multivariate normality (which is what we would like to have), but it helps. Another detail to take into account is that in big samples (>200 observations) normality is not such an issue. However, if we solve normality, we avoid a lot of other problems (e.g. heteroscedacity) so that's the main reason why we are doing this analysis.
Homoscedasticity - I just hope I wrote it right. Homoscedasticity refers to the 'assumption that dependent variable(s) exhibit equal levels of variance across the range of predictor variable(s)' (Hair et al., 2013). Homoscedasticity is desirable because we want the error term to be the same across all values of the independent variables.
Linearity- The most common way to assess linearity is to examine scatter plots and search for linear patterns. If patterns are not linear, it would be worthwhile to explore data transformations. However, we'll not get into this because most of the scatter plots we've seen appear to have linear relationships.
Absence of correlated errors - Correlated errors, like the definition suggests, happen when one error is correlated to another. For instance, if one positive error makes a negative error systematically, it means that there's a relationship between these variables. This occurs often in time series, where some patterns are time related. We'll also not get into this. However, if you detect something, try to add a variable that can explain the effect you're getting. That's the most common solution for correlated errors.
pearson correlation
strong correlation coefficient is close to 1 or -1 and Pvalue less than 0.001
anova should give a high F value and small pvalue
** Notes **
Tree-based models don't depend on scaling,
while non-tree-based models usually depend on them.
Second, we can treat scaling as
an important hyper parameter in cases
when the choice of scaling impacts predictions quality.
And at last, we should remember
that feature generation is powered by an understanding of the data.
First, ordinal is a special case of
categorical feature but with values sorted in some meaningful order.
Second, label encoding, basically replace
this unique values of categorical features with numbers.
Third, frequency encoding in this term,
maps unique values to their frequencies.
Fourth, label encoding and frequency encoding are often used for tree-based methods.
Fifth, One-hot encoding is often used for non-tree-based-methods.
And finally, applying One-hot encoding combination one heart
and chords into combinations of categorical features
allows non-tree- based-models to take into
consideration interactions between features, and improve.
Great, we just summarize the most frequent methods used for
future generation from datetime and coordinates.
For datetime, these are applying periodicity, calculates in time passed
since particular event, and engine differences between two datetime features.
For coordinates, we should recall extracting interesting samples from
trained test data, using places from additional data, calculating distances to
centers of clusters, and adding aggregated statistics for surrounding area.
The choice of method to fill not a numbers depends on the situation.
Sometimes, you can reconstruct missing values.
But usually, it is easier to replace them with value outside of
feature range, like -999 or to replace them with mean or median.
Also missing values already can be replaced with something by organizers.
In this case if you want know exact rows which have missing values
you can investigate this by browsing histograms.
More, the model can improve its results using binary feature isnull which
indicates what roles have missing values.
In general, avoid replacing missing values before feature generation,
because it can decrease usefulness of the features.
And in the end, Xgboost can handle not a numbers directly,
which sometimes can change the score for the better.
Using knowledge you have derived from our discussion,
now you should be able to identify missing values.
Describe main methods to handle them, and
apply this knowledge to gain an edge in your next computation.
check for bag of words and ngrams, word2vec
Removing Constant features
Constant features are the type of features that contain only one value for all the outputs in the dataset. Constant features provide no information that can help in classification of the record at hand. Therefore, it is advisable to remove all the constant features from the dataset.
Data should be shuffled because sort data might not be training right because might introduce some bias.
Are there any real outliers in the dataset or there are just,
let's say, unexpectedly high values that we should treat just as others?
Outliers have usually mistakes,
measurement errors, and so on,
but at the same time,
similarly looking objects can be of natural kind.
So, if you think these unusual objects are normal in the sense that they're just rare,
you should not use a metric which will ignore them.
And it is better to use MSE.
Otherwise, if you think that they are really outliers,
like mistakes, you should use MAE.
Mean Encoding can be used for categorical features, but regularization is a necesity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment