Friday, July 8, 2011

Statistical Analysis: Cross Validation

Cross validation is a technique used to predict how well the model will generalize in practice.

In this example, we will use 5-fold cross validation.
In cross-validation, one section is used as the testing set, and the rest are used as the training set. The testing set is rotated throughout the five folds, and each test set is used exactly once throughout this process.

After evaluating each of the 5 models, we average the results.
(Results rounded to nearest hundredth)

Results for Decision Tree model:
Confusion matrix:                                            
     








Class partitioning and performance evaluation:











Overall Accuracy:
0.5018 (28.6 / 57)

Results for Nearest Neighbor (k=1) model:
Confusion matrix: 









Class partitioning and performance evaluation:











Overall Accuracy:
0.7544 (43 / 57)

Results Analysis:
Nearest neighbor model appears to perform better than the decision tree model.
This might be because the decision tree must split the data in a binary fashion at each non-leaf node, so it must choose static, or absolute splits, ie. "is x greater than 5?", or "is the color purple?". This results in a roughly speaking, "hit-or-miss" evaluation. Whereas the nearest neighbor model predicts a class based on its relative location to its neighbors and is not based on any "absolute" splitting expression, this allows the model to be more lenient to outlier data.

No comments:

Post a Comment