Thursday, July 7, 2011

Understanding different classifiers: Decision Trees

Tony suggested that I look into another classifier besides the nearest neighbor classifier. Not in the hopes that it will perform any better, but for the sake of my own learning and understanding the different types of classifiers.


Initial Setup:
There are 389 samples across all species in the data set.
Each image is initially 4080x2720 resolution, and compressed to 448x299.
The samples are split roughly in half to form the Training Set and the Validation Set.


Here are the numbers of samples per class:
Test Set:             Training Set:

0:   10                 9
1:   30                 31
2:   24                 23
3:   22                 22
4:   9                   10
5:   12                 12
6:   11                 10
7:   17                 17
8:   18                 18
9:   23                 24
10: 19                 18


Decision Tree was setup to use 10-fold cross validation, use surrogate splits, and use pruning.

Results and Comparison:
Decision Tree                                        K-Nearest Neighbor (k=6)

[7, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0;                   [9, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0;
  0, 21, 3, 0, 0, 0, 3, 1, 0, 2, 0;                  0, 23, 0, 1, 0, 0, 0, 1, 2, 3, 0;
  0, 0, 11, 0, 3, 1, 6, 0, 1, 1, 1;                  0, 0, 15, 0, 0, 0, 0, 0, 5, 0, 4;
  0, 4, 2, 10, 0, 0, 0, 2, 1, 3, 0;                  0, 4, 0, 13, 0, 0, 0, 5, 0, 0, 0;
  0, 1, 2, 0, 4, 0, 1, 0, 0, 1, 0;                    0, 5, 3, 0, 1, 0, 0, 0, 0, 0, 0;
  0, 1, 2, 0, 2, 6, 1, 0, 0, 0, 0;                    0, 1, 3, 0, 0, 6, 0, 0, 2, 0, 0;
  0, 1, 3, 5, 0, 0, 1, 0, 0, 1, 0;                    0, 5, 0, 2, 0, 0, 0, 0, 2, 2, 0;
  0, 0, 1, 4, 0, 0, 0, 12, 0, 0, 0;                  0, 0, 0, 0, 0, 0, 0, 17, 0, 0, 0;
  0, 1, 8, 0, 0, 0, 0, 0, 8, 0, 1;                    0, 0, 2, 0, 0, 0, 0, 0, 16, 0, 0;
  0, 0, 0, 3, 0, 0, 0, 0, 0, 20, 0;                  0, 5, 0, 1, 0, 0, 0, 0, 1, 16, 0;
  0, 0, 5, 0, 0, 2, 0, 0, 0, 0, 12]                 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 16]

Total accuracy on validation set:
D-Tree: 0.574359 (112/195)
k-NN:    0.676923 (132/195)


Individual class accuracy:
D-Tree                                        k-NN
0:   0.69999999                        0.89999998
1:   0.69999999                        0.76666665
2:   0.45833334                        0.625
3:   0.45454547                        0.59090906
4:   0.44444445                        0.11111111
5:   0.50                                      0.50
6:   0.090909094                      0.00
7:   0.70588237                        1.00
8:   0.44444445                        0.8888889
9:   0.86956519                        0.69565219
10: 0.63157892                        0.84210527

Results Interpreted:
Firstly, we notice that the k-NN classifier outperforms the d-tree classifier by about 10% overall, but the individual class performance of the d-tree classifier has less standard deviation than the k-NN classifier.

Class 6 has the highest error rate, and from speculation, this may be because Class 6's shades of colors are similar to those of other classes.

One reason the k-NN classifier might be performing better than the d-tree is because its method of classification is based on relative distance/voting for each sample, whereas the d-tree is based on optimal, but absolute boundaries that split the data at each node (binary tree).

I also switched the testing and training sets, and the results seem roughly the same, so we know that neither of the sets is biased toward the features.

No comments:

Post a Comment