Random Forest

What Is Random Forest?

Random forest is a popular ensemble learning method used for both classification and regression because of its simplicity, flexibility and superior performance. The random forest training algorithm uses bootstrap aggregation or bagging to reduce the variance of the model.

Random forest models can be trained efficiently on large datasets. The training set for each tree is about two-thirds of the original data and is drawn by bootstrap sampling with replacement. The remaining one-third of the training data are left out to be used to maintain a running unbiased estimate of the classification error and variable importance. Random forest takes only a random subset of features instead of finding the most important feature for splitting a node. This leads to more diversity and less correlation among trees, which in turn leads to better model performance.

The random forest model performance depends on two factors:

  1. The correlation between any two trees in the ensemble. Increasing the correlation decreases model performance.
  2. The performance of each individual tree in the ensemble. Improving the performance of individual trees boosts model performance. Moreover, when more trees are added to the ensemble, the model tends to overfit less.


Why Is Random Forest Important?

Decision trees suffer from high variance. In general, averaging a set of observations reduces variance. A natural way to reduce the variance and increase the prediction performance of a statistical learning method is to take many training sets from the population, build a separate prediction model using each training set, and average the resulting predictions. This fundamental learning concept is the key idea behind random forest.

Random forest is a powerful and popular method. Random forest is easy to tune with intuitive hyperparameters that usually produce a great result, even without hyperparameter tuning. It can be used for both regression and classification tasks, plus it’s easy to view the relative importance it assigns to the input features. These are all favorable properties for a machine learning model. One of the biggest problems in machine learning is overfitting, but random forest models tend to generalize well without overfitting, particularly as the number of trees in the random forest model increases.


How C3.ai Helps Organizations Apply Random Forest

Random forest models are supported on MLPipelines, a C3 AI® Suite artifact that dramatically simplifies the training, deployment, and maintenance of ML models at scale. Other C3 AI Suite services such as hyperparameter optimization are supported on top of MLPipelines, simplifying the tuning of such models.