(This post responds to an assignment for MAS.S62 Interactive Machine Learning at the MIT Media Lab to analyze the input and output channels of a machine learning algorithm for their potential as affordances for interaction.)
When examined for its potential for interaction affordances, Random Decision Forests (Breiman 2001) distinguishes itself from other machine learning algorithms in its potential for transparency. Due to the nature of the algorithm, most Random Decision Forest implementations provide an extraordinary amount of information about the final state of the classifier and how it derived from the training data.
In this analysis, I discuss five outputs that are available from a Random Decision Forest and ways they could be used to provide interface or visualization options for a layman user of such a classifier. I also describe one input that could be similarly useful.
(For each output and input, I provide a link to the corresponding function in the OpenCV Random Decision Forest implementation. Other implementations should also provide similar access.)
Output: variable importance
In addition to returning the classification result, most Random Decision Forest implementations can also provide a measure of the importance that each variable in the feature vector played in the result. These importance scores are calculated by adding noise to each variable one-by-one and calculating the corresponding increase in the misclassification rate.
Presenting this data to the user in the form of a table, ranked list, or textual description could aid in feature selection and also help improve user understanding of the underlying data.
OpenCV’s implementation: CvRTrees::getVarImportance()
Output: proximity between any two samples
A trained Random Decision Forest can calculate the proximity between any two given samples in the training set. Proximity is calculated by comparing the number of trees where the two samples ended up in the same leaf node to the total number of trees in the ensemble.
This proximity data could be presented to the user of an interactive machine learning system in order to both improve the user’s understanding of the current state of training and to suggest additional labeled samples that would significantly improve classification. By iteratively calculating the proximities of each pair of samples in the training set (or a large subset of these) a system could produce a navigable visualization of the existing training samples that could significantly aid the user in identifying mis-labeled samples, crafting useful additional samples, and understanding the causes of the system’s predictions.
OpenCV’s implementation: CvRTrees::get_proximity()
Output: prediction confidence
Due to the ensemble structure of a Random Decision Forest, the classifier can calculate a confidence score for its predictions. The confidence score is calculated based on the proportion of decision trees in the forest that agreed with the winning classification for the given sample.
This confidence could be presented to a user in multiple different ways. A user could set a confidence threshold below which predictions should be ignored; the system could prompt the user for additional labeled samples whenever the confidence is too low; or the confidence could be reflected in the visual presentation of the prediction (size, color, etc) so that the user can take it into consideration.
OpenCV’s implementation: CvRTrees::predict_prob() (Note: OpenCV’s implementation only works on binary classification problems.)
Output: individual decision trees
Since Random Decision Forest is usually implemented on top of a simpler decision tree classifier, many implementations provide direct access to the individual decision trees that made up the ensemble.
With access to the individual decision trees, an application could provide the user with a comprehensive visualization of the Forest’s operation including showing the error rates for the individual trees and the variable on which each tree made each split. This visualization could aid in feature selection and in-depth evaluation and exploration of the quality of the training set.
OpenCV’s implementation: CvRTrees::get_tree()
OUTPUT: calculate training error
Since Random Decision Forests store each of their training samples internally as they construct their decision trees, unlike many other machine learning methods, they can evaluate their own training error after the completion of training. On classification problems, this error is calculated as the percentage of mis-classified training samples, in regression problems it is the mean square of the errors.
This error metric is simple enough that it could be shown to an end-user as a basic form of feedback on the current state of training quality. However, without other metrics, this would create the danger of encouraging the user to work towards overfitting the training sample.
CvRTrees::get_train_error() (Note: OpenCV’s implementation only works on classification problems.)
Input: Max number of trees in the forest
The most important input for a user to a Random Decision Forest is the maximum number of trees allowed in the forest. Up to the point of diminishing returns, this is essentially a proxy for the trade-off between training time and result quality.
This could be presented to the user as a slider, allowing them to choose faster training or better results throughout the process of interactively improving a classifier.
OpenCV’s implementation: CvDTreeParams::set_max_depth()