1 Answers
- Take the entire data set as input
- Search for a split that maximizes the “separation” of the classes. A split is any test that divides the data in two (e.g. if variable2>10)
- Apply the split to the input data (divide step)
- Re-apply steps 1 to 2 to the divided data
- Stop when you meet some stopping criteria
- (Optional) Clean up the tree when you went too far doing splits (called pruning)
Finding a split: methods vary, from greedy search (e.g. C4.5) to randomly selecting attributes and split points (random forests)
Purity measure: information gain, Gini coefficient, Chi Squared values
Stopping criteria: methods vary from minimum size, particular confidence in prediction, purity criteria threshold
Pruning: reduced error pruning, out of bag error pruning (ensemble methods)
Source
Your Answer