ltiao/answers.md

## answers.md

      
    Raw
  

              answers.md
            
          
    COMP9417 - Machine Learning: Assignment 1

Chi-Chun Tiao: 3390558


a.	Observing the results of comparison by Percent_correct using DecisionStump as the baseline, we can see that there is no statistically significant difference in accuracy (Percent_correct) between J48 and DecisionStump on the datasets credit-rating (85.51% vs. 85.57%) and vote (95.63% vs. 96.57%).
First note that J48 is simply an implementation of the C4.5 algorithm that generates a decision tree of arbitrary depth, while DecisionStump generates a decision tree of depth one, i.e. a set of rules that test one attribute. Thus, one possible reason their accuracy might be approximately equivalent is that the inductive bias of decision tree learning (Occam's razor/preference of shorter trees) may not suit the dataset well, causing both algorithms to perform equally poor - not the case here with either dataset (both over 85% accuracy)! Another possible cause is the existence of a single attribute in the dataset that yields high information gain. Clearly, this will be the only attribute tested in the decision stump and the dominant attribute in the decision tree, being the biggest factor in classification of instances.
Upon examination of the structure of the data, we see that this is indeed the case for both datasets:

credit-rating: The information gain the attribute A9 yields is 0.425709 - twice as much as that of any other attribute.
physician: The information gain the attribute physician-fee-freeze yields is 0.7078541 - again, almost twice as much as that of any other attribute.

As expected, these attributes are both at the root node of their respective decision trees and stumps. Now, we also expect that the classifications associated at the leaves of the decision tree to coincide with the classification of the decision stump, i.e. the majority classification at one side of the decision tree should be in favour of the classification of the corresponding side of the decision stump, and we see that this is indeed true as well:

credit-rating:

Decision stump - A9 = t : +, A9 = f : -
Decision tree - A9 = t: [11+, 8-], A9 = f: [4+, 7-]


physician

Decision stump - physician-fee-freeze = n : democrat, physician-fee-freeze = y : republican
Decision tree - physician-fee-freeze = n : [1 democrat, 0 republican], physician-fee-freeze = y : [2 democrat, 3 republican]


Hence, for these datasets, the decision tree and stump learned by J48 and DecisionStump are approximately the same (barring more the classification of more specific examples), and inevitably classifies instances in almost the same way, causing their accuracy to have no statistically significant difference.


b.	Over-fitting...


c.	First of all, note that there are 20,000 instances in the dataset - at least 20 times as many as the other datasets under consideration. Obviously, the number of instances will drastically affect the training time but furthermore, this dataset contains a total of 17 continuous-valued attributes, and 26 discrete output values. This means that for each continuous-valued attribute, we have to define a threshold-based boolean attribute that produces the greatest information gain. Additionally, since there are many output values, there are potentially a large number of leaf nodes (there are in fact 1226 leaf nodes).


Note that the datasets respond differently to the change of minimum number of instances per leaf: the decision tree for Protein_response_to_perox appears to become less accurate as we increase the minimum number of instances per leaf while that of Protein_response_to_sorbi appears to become more accurate. Now, by increasing the minimum number of instances per leaf, we are in effect reducing the number of leaf nodes in the tree hence generalising. So in the case of Protein_response_to_perox, we can say that the inductive bias of our tree learner maybe not be well-suited to our dataset: shorter (more general) trees do not necessarily perform better. Whereas in the case of Protein_response_to_sorbi, we may have actually over-fitted the dataset when our tree was too specific. We see that as wee increase the minimum number of instances per leaf, thereby decreasing the size of the tree and generalising, the accuracy increases. In fact, the accuracy continues to increase until measureTreeSize is increased to the point where the decision tree just becomes a decision stump. As implicitly stated earlier, the tree size decreases as the minimum number of instances per leaf increases and of course such is the case with our 2 datasets. This is obvious from the fact that we're in effect "pruning" leaf nodes that are not satisfied by the minimum number of instances.


Let us inspect the leaf nodes that yield the lowest and highest numerically predicted values of the (owner-occupied) homes respectively, namely,
LM num: 15
MEDV = 
	+ 12.002

and
LM num: 6
MEDV = 
	+ 39.3832

to see if we can shed light on any potentially interesting and meaningful patterns.
By inspecting the branch of the tree that leads us to LM15,
LSTAT >  9.725 : 
|   LSTAT >  15 : 
|   |   CRIM >  5.769 : 
|   |   |   LSTAT >  19.73 : 
|   |   |   |   NOX >  0.675 : LM15 (34/23.674%)

we see that the LSTAT attribute appears numerous times, which denotes the percentage of the lowest status of the area's population, indicating that more than approximately 20% of the population are of the lowest status LSTAT >  19.73 - not exactly a thriving neighbourhood. Furthermore, the crime rate (per capita) of the area is above 5.7 and the nitric oxides concentration (parts per 10 million) is above 0.6, both of which are the highest amongst other decision nodes in the tree. Hence we can loosely interpret from this branch of the tree that our Regression Tree will predict the value of houses in poorer areas (over 20% of the population composed of the lowest status) with high crime rates, nitric oxides concentration to be comparatively low (~$12,000). This is a rather intuitive result, and in fact we can see that even houses in areas where over 15% of the population compose the lowest status will be predicted to have relatively low value (<$18,000).
Now let us try to uncover some meaningful patterns in housing areas that are predicted to have high value by following the branch:
LSTAT <= 9.725 :
|   RM >  6.941 : 
|   |   RM >  7.437 : LM6 (30/65.87%)

Firstly, we see that the lower status consists only of less than 10% of the population, so the neighbourhood is not exactly poor. Next, we see that the average number of rooms per dwelling is over 7.4 - rather large houses. Again, rather intuitively, this Regression Tree predicts rather "large" houses in "richer" areas to have a comparatively high value (~$39,000). In fact, any such areas (where the lower status consists only of less than %10 of the population) with over 6.9 rooms per dwelling will be predicted to have value over around $33,000. Again, such a pattern fits our intuition.
Another pattern worth noting is the values of LM1 as compared to LM2 and LM3. This hints that houses in "richer" areas with less than 6.9 rooms per dwelling that are closer to employment centres (within 3.3 weighted units) will be predicted to have higher value (~$3-4,000 higher) than that of those that are not as close.
This is a rather important and meaningful pattern that is not entirely intuitive: one would not generally expect proximity to employment centres to play a key factor in housing prices, and its not entirely clear why it is such a key factor. One interpretation could be that employment rates and hence socioeconomic status of the area's population would be higher (though one might expect poorer areas to require more employment centres). Perhaps this is simply another, more finer, subtle way of expressing the status of the population than the LSTAT attribute.


a.	By increasing the number of iterations of gradient descent to 2000 and the number of hidden layers in the network to 20, we increased the accuracy to 80.3%. However, this increase and the accuracy measure itself is quite deceptive, as we are only validating against training data. Furthermore, all we have done to "increase" the accuracy is increase the number of weight-tuning iterations and the number of hidden layers in the network - effectively increasing the complexity of the learned decision surface, creating an overly complex decision surface that is not at all representative of the general distribution of unseen examples and fits spurious and noisy training examples.
Here, not only are we solely considering the accuracy with training examples, we are also decreasing the generalisation accuracy over other unseen examples - a disastrous recipe for over-fitting.	Indeed, we see that testing again with 10-fold cross-validation, the accuracy as decreased as much as ~30% to just 51%.
Ways to address this issue and thereby improve the accuracy of neural network learning without changing the implementation of the algorithm is to modify the number of weight-tuning iterations it performs with respect to some validation set. That is, we use a training set as usual to perform the gradient descent search and use the number of iterations that produces the lowest error over that validation set. This ensures that once we reach the lowest error with respect to the validation set, we cease to tune the weights any further to avoid over-fitting. This approach works well for out dataset because we have 1000 instances. In the case where data is scarce, we can simply turn to k-fold validation whereby we simply run the cross-validation procedure k times to determine the number of iterations that yield the best performance on the validation set. We can then set the number of weight-tuning iterations to the average of these numbers.


b.	Upon examining the results produced by Apriori with the specified parameters, we see a few resulting rules that may be interesting as they apply to more than 40% (400 instances) of the dataset, i.e. these rules have high support. It turns out that these are the single-consequent, single-antecedent rules and indeed by increasing the lowerBoundMinSupport parameter to 0.4, we see that only these rules alone are returned. By mining for rules with especially high support and confidence, we find that these 4 resulting rules embody the interesting associations in the dataset.


a.	From the results, we see that the relative absolute error of Linear Regression is higher than than that of the other schemes on most data sets except for pdgfr2. This may be the result of the linearity constraints of the hypotheses it produces. That is, if the datasets exhibit a nonlinear dependency, the best-fitting straight line found, while it may fit very well with respect to the lease mean-squared difference simply does not suffice and performs poorly as a result. Note however that in cases where the target concept is in fact linear while the training examples contain noise that indicate otherwise, Linear Regression will perform better than multilayered perceptrons and model trees as they are likely to produce overly complex decision surfaces that fit the idiosyncrasies of noisy instances at the cost of generalisation accuracy. This is very likely to be the case with pdgfr2 where the relative absolute error for multilayered perceptron learning is 2.420671494481303E152!
Note that while
Note in all other cases


b.	I rove u rong time