Why Boosting Works
Boosting performs well in practice, with a few surprising characteristics. As boosting proceeds through more rounds, it is building an increasingly complex hypothesis. Yet a plot of the performance on the training and test datasets shows that it doesn’t seem to reach a point where it overfits. It also continues to improve test set (generalization) accuracy, even after the training set accuracy has gone to zero.
All boosting algorithms share two characteristics: combining a number of weak learners to make a prediction, and the fact that those weak learners are iteratively trained on the remaining error. There are different accounts of why boosting performs so well, based on different theories: a margin based account focused on the iterative fitting and bias/variance account, focused on the fact that boosting is an ensemble method.
Other ensemble methods, notably bagging and random forests, also combine multiple classifiers. In bagging, the weights for combining classifiers (w_i) are all the same (majority vote), and at each iteration, the datasets are created by simple resampling with replacement. There are two differences between boosting and bagging : the classifiers aren’t weighted by importance, and the base learners are not fit by iteratively focusing on the “harder” data points. Because bagging often improves the performance of a classifier, it argues that the performance of bagging, boosting, and all ensemble methods is that averaging helps: it is well known in statistics that combining multiple estimates can increase the accuracy by correcting for the bias and variance of any individual estimate.
The other explanation for boosting is that it helps because of the stagewise fitting procedure, by focusing attention on the misclassified examples, using reweighting or the gradient. There are two explanations why this is a good thing to do: Adaboost increases the margin, and Gradient Boost is performing gradient descent on the loss function. Adaboost works by increasing the margin: the reweighting will force it to concentrate attention on the data points which remain ambiguous: these will lie along the boundary between the classes: the same data points which a support vector machine would focus on. Gradient boosting explains boosting from the point of view of gradient descent. Thinking of boosting as gradient descent explains the reweighting in a different way: as we progress through rounds of boosting, the gradient of the loss function will point in a direction can travel to further minimize the function. If only a few data points remain misclassified, the gradient will point in a direction to minimize loss on those points (since the others are already well-fit).