Started watching Week2 of Andrew Ng's coursera Machine Learning course.
Beginning with Model & Cost function. For Model representation, showed correlation of house size to sale price in Portland. In this example, the house size is the x-value input, and the y-value is the sale price. It is a supervised learning scenario, since we train the model on known historical sales prices.
This is an example of univariate linear regression, or linear regression with one variable. Because we're trying to predict continuous real values, we call this a "regression problem". Otherwise, if there are only discrete values, it is a "classification problem".
To validate this scenario for the financial projections case, I'll use the google_vix_results.csv
data below, which attempts to predict the high of the GOOG equity, from the prior day's GOOGVIX volatility index high. The prior_day_vix_futures_high
will be the x and the the observed_equity_high
will be the y. m is 1752
.
We want to understand the parameters of the hypothesis, for the linear regression. The standard form would be something like:
h(theta) = theta(0) + theta(1)*x
Where theta(0)
is a constant y value, theta(1)
is the slope y value, and x
is the slope denominator. We let theta(0)
be zero by default, so h(theta) = theta(1)x
. So, we want to find h(theta) - y
is small; usually we use the least squared difference, so SUM m->i((h(theta) - y) ^ 2)
The cost function is written as:
J(theta(0), theta(1)) = (1/2m)SUM m->i((h(theta) - y) ^ 2)