Random Forest Classifier Tutorial: How to Use Tree-Based Algorithms for Machine Learning (2023)

Tree-based algorithms are popular machine learning methods used to solve supervised learning problems. These algorithms are flexible and can solve any kind of problem at hand (classification or regression).

Tree-based algorithms tend to use the mean for continuous features or mode for categorical features when making predictions on training samples in the regions they belong to. They also produce predictions with high accuracy, stability, and ease of interpretation.

There are different tree-based algorithms that you can use, such as

  • Decision Trees
  • Random Forest
  • Gradient Boosting
  • Bagging (Bootstrap Aggregation)

So every data scientist should learn these algorithms and use them in their machine learning projects.

In this article, you will learn more about the Random forest algorithm. After completing this article, you should be proficient at using the random forest algorithm to solve and build predictive models for classification problems with scikit-learn.

Random forest is one of the most popular tree-based supervised learning algorithms. It is also the most flexible and easy to use.

The algorithm can be used to solve both classification and regression problems. Random forest tends to combine hundreds of decision trees and then trains each decision tree on a different sample of the observations.

The final predictions of the random forest are made by averaging the predictions of each individual tree.

The benefits of random forests are numerous. The individual decision trees tend to overfit to the training data but random forest can mitigate that issue by averaging the prediction results from different trees. This gives random forests a higher predictive accuracy than a single decision tree.

The random forest algorithm can also help you to find features that are important in your dataset. It lies at the base of the Boruta algorithm, which selects important features in a dataset.

Random forest has been used in a variety of applications, for example to provide recommendations of different products to customers in e-commerce.

(Video) Random Forest Classifier Tutorial: How to Use Tree-Based Algorithms for Machine Learning

In medicine, a random forest algorithm can be used to identify the patient’s disease by analyzing the patient’s medical record.

Also in the banking sector, it can be used to easily determine whether the customer is fraudulent or legitimate.

The random forest algorithm works by completing the following steps:

Step 1: The algorithm select random samples from the dataset provided.

Step 2: The algorithm will create a decision tree for each sample selected. Then it will get a prediction result from each decision tree created.

Step 3: Voting will then be performed for every predicted result. For a classification problem, it will use mode, and for a regression problem, it will use mean.

Step 4: And finally, the algorithm will select the most voted prediction result as the final prediction.

Random Forest Classifier Tutorial: How to Use Tree-Based Algorithms for Machine Learning (1)


Now that you know the ins and outs of the random forest algorithm, let's build a random forest classifier.

We will build a random forest classifier using the Pima Indians Diabetes dataset. The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years based on provided medical details. This is a binary classification problem.

Our task is to analyze and create a model on the Pima Indian Diabetes dataset to predict if a particular patient is at a risk of developing diabetes, given other independent factors.

(Video) Random Forest Algorithm Clearly Explained!

We will start by importing important packages that we will use to load the dataset and create a random forest classifier. We will use the scikit-learn library to load and use the random forest algorithm.

# import important packagesimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inlinefrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import accuracy_scorefrom sklearn.preprocessing import StandardScaler, MinMaxScalerimport pandas_profilingfrom matplotlib import rcParamsimport warningswarnings.filterwarnings("ignore")# figure size in inchesrcParams["figure.figsize"] = 10, 6np.random.seed(42)

Dataset

Then load the dataset from the data directory:

# Load datasetdata = pd.read_csv("../data/pima_indians_diabetes.csv")

Now we can observe the sample of the dataset.

# show sample of the datasetdata.sample(5)
Random Forest Classifier Tutorial: How to Use Tree-Based Algorithms for Machine Learning (2)

As you can see, in our dataset we have different features with numerical values.

Let's understand the list of features we have in this dataset.

# show columnsdata.columns
Random Forest Classifier Tutorial: How to Use Tree-Based Algorithms for Machine Learning (3)

In this dataset, there are 8 input features and 1 output / target feature. Missing values are believed to be encoded with zero values. The meaning of the variable names are as follows (from the first to the last feature):

  • Number of times pregnant.
  • Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
  • Diastolic blood pressure (mm Hg).
  • Triceps skinfold thickness (mm).
  • 2-hour serum insulin (mu U/ml).
  • Body mass index (weight in kg/(height in m)^2).
  • Diabetes pedigree function.
  • Age (years).
  • Class variable (0 or 1).

Then we split the dataset into independent features and target feature. Our target feature for this dataset is called class.

# split data into input and taget variable(s)X = data.drop("class", axis=1)y = data["class"]

Preprocessing the Dataset

Before we create a model we need to standardize our independent features by using the standardScaler method from scikit-learn.

# standardize the datasetscaler = StandardScaler()X_scaled = scaler.fit_transform(X)

You can learn more on how and why to standardize your data from this article by clicking here.

Splitting the dataset into Training and Test data

We now split our processed dataset into training and test data. The test data will be 10% of the entire processed dataset.

(Video) Machine Learning Tutorial Python - 11 Random Forest

# split into train and test setX_train, X_test, y_train, y_test = train_test_split( X_scaled, y, stratify=y, test_size=0.10, random_state=42)

Building the Random Forest Classifier

Now is time to create our random forest classifier and then train it on the train set. We will also pass the number of trees (100) in the forest we want to use through the parameter called n_estimators.

# create the classifierclassifier = RandomForestClassifier(n_estimators=100)# Train the model using the training setsclassifier.fit(X_train, y_train)
Random Forest Classifier Tutorial: How to Use Tree-Based Algorithms for Machine Learning (4)

The above output shows different parameter values of the random forest classifier used during the training process on the train data.

After training we can perform prediction on the test data.

# predictin on the test sety_pred = classifier.predict(X_test)

Then we check the accuracy using actual and predicted values from the test data.

# Calculate Model Accuracyprint("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.8051948051948052

Our accuracy is around 80.5% which is good. But we can always make it better.

Identify Important Features

As I said before, we can also check the important features by using the feature_importances_ variable from the random forest algorithm in scikit-learn.

# check Important featuresfeature_importances_df = pd.DataFrame( {"feature": list(X.columns), "importance": classifier.feature_importances_}).sort_values("importance", ascending=False)# Displayfeature_importances_df
Random Forest Classifier Tutorial: How to Use Tree-Based Algorithms for Machine Learning (5)

The figure above shows the relative importance of features and their contribution to the model. We can also visualize these features and their scores using the seaborn and matplotlib libraries.

# visualize important featuers# Creating a bar plotsns.barplot(x=feature_importances_df.feature, y=feature_importances_df.importance)# Add labels to yourplt.xlabel("Feature Importance Score")plt.ylabel("Features")plt.title("Visualizing Important Features")plt.xticks( rotation=45, horizontalalignment="right", fontweight="light", fontsize="x-large")plt.show()
Random Forest Classifier Tutorial: How to Use Tree-Based Algorithms for Machine Learning (6)

From the figure above, you can see the triceps_skinfold_thickness feature has low importance and does not contribute much to the prediction.

This means that we can remove this feature and train our random forest classifier again and then see if it can improve its performance on the test data.

(Video) StatQuest: Random Forests Part 1 - Building, Using and Evaluating

# load data with selected featuresX = data.drop(["class", "triceps_skinfold_thickness"], axis=1)y = data["class"]# standardize the datasetscaler = StandardScaler()X_scaled = scaler.fit_transform(X)# split into train and test setX_train, X_test, y_train, y_test = train_test_split( X_scaled, y, stratify=y, test_size=0.10, random_state=42)

We will train the random forest algorithm with the selected processed features from our dataset, perform predictions, and then find the accuracy of the model.

# Create a Random Classifierclf = RandomForestClassifier(n_estimators=100)# Train the model using the training setsclf.fit(X_train, y_train)# prediction on test sety_pred = clf.predict(X_test)# Calculate Model Accuracy,print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.8181818181818182

Now the model accuracy has increased from 80.5% to 81.8% after we removed the least important feature called triceps_skinfold_thickness.

This suggests that it is very important to check important features and see if you can remove the least important features to increase your model's performance.

Tree-based algorithms are really important for every data scientist to learn. In this article, you've learned the basics of tree-based algorithms and how to create a classification model by using the random forest algorithm.

I also recommend you try other types of tree-based algorithms such as the Extra-trees algorithm.

You can download the dataset and notebook used in this article here: https://github.com/Davisy/Random-Forest-classification-Tutorial

Congratulations, you have made it to the end of this article!

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post! I can also be reached on Twitter @Davis_McDavid

FAQs

What's the best number of trees in a random forest algorithm? ›

They suggest that a random forest should have a number of trees between 64 - 128 trees. With that, you should have a good balance between ROC AUC and processing time.

How the random forest algorithm works in machine learning? ›

Step 1: Select random samples from a given data or training set. Step 2: This algorithm will construct a decision tree for every training data. Step 3: Voting will take place by averaging the decision tree. Step 4: Finally, select the most voted prediction result as the final prediction result.

Is random forest a tree-based algorithm? ›

Random forest is one of the most popular tree-based supervised learning algorithms. It is also the most flexible and easy to use. The algorithm can be used to solve both classification and regression problems.

What type of machine learning algorithm is random forest? ›

Random forest is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems.

Which algorithm is better decision tree or random forest? ›

Which algorithm is better: decision tree or random forest? A. Random forest is a strong modeling technique and much more robust than a decision tree. Many Decision trees are aggregated to limit overfitting as well as errors due to bias and achieve the final result.

What is the best use for a tree algorithm? ›

Tree based algorithms are considered to be one of the best and mostly used supervised learning methods. Tree based algorithms empower predictive models with high accuracy, stability and ease of interpretation. Unlike linear models, they map non-linear relationships quite well.

What is difference between decision tree and random forest? ›

A decision tree combines some decisions, whereas a random forest combines several decision trees. Thus, it is a long process, yet slow. Whereas, a decision tree is fast and operates easily on large data sets, especially the linear one. The random forest model needs rigorous training.

What is decision tree algorithm in machine learning? ›

A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.

How do you choose the number of trees in random forest? ›

To tune number of trees in the Random Forest, train the model with large number of trees (for example 1000 trees) and select from it optimal subset of trees. There is no need to train new Random Forest with different tree numbers each time.

Is XGBoost a decision tree or random forest? ›

Random Forest and XGBoost are two popular decision tree algorithms for machine learning. In this post I'll take a look at how they each work, compare their features and discuss which use cases are best suited to each decision tree algorithm implementation.

How does random forest generate trees? ›

The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.

What is the benefit of using a random forest model over a single decision tree? ›

The biggest advantage of Random forest is that it relies on collecting various decision trees to arrive at any solution. This is an ensemble algorithm that considers the results of more than one algorithms of the same or different kind of classification.

Why do we prefer a forest collection of trees rather than a single tree? ›

The majority prediction from multiple trees is better than an individual tree prediction because the trees protect each other from their individual errors. This is, however, dependant on the trees being relatively uncorrelated with each other.

Is random forest algorithm supervised or unsupervised? ›

Random forest is a supervised learning algorithm. A random forest is an ensemble of decision trees combined with a technique called bagging. In bagging, decision trees are used as parallel estimators.

What are the limitations of random forest classifier? ›

The main limitation of random forest is that a large number of trees can make the algorithm too slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained.

Why we should prefer random forest classifier than decision tree ?- Explain with example? ›

Random forest algorithm avoids and prevents overfitting by using multiple trees. The results are not accurate. This gives accurate and precise results. Decision trees require low computation, thus reducing time to implement and carrying low accuracy.

Is more trees better in random forest? ›

More trees usually means higher accuracy at the cost of slower learning. If you wish to speed up your random forest, lower the number of estimators. If you want to increase the accuracy of your model, increase the number of trees.

Are tree based algorithms better than linear models? ›

When there are large number of features with less data-sets(with low noise), linear regressions may outperform Decision trees/random forests. In general cases, Decision trees will be having better average accuracy. For categorical independent variables, decision trees are better than linear regression.

What is a real life example of tree data structure? ›

Real life examples.

One of the representations of a tree is a family tree with relationships from all generations: grandparents, parents, children, siblings, etc. We commonly organize family trees hierarchically. An organization's structure is another example of a hierarchy.

What is the best use for a tree algorithm recursion? ›

A recursion tree is useful for visualizing what happens when a recurrence is iterated. It diagrams the tree of recursive calls and the amount of work done at each call.

What role does the number of trees play in a random forest algorithm? ›

The number of trees parameter in a random forest model determines the number of simple models, or the number of decision trees, that are combined to create the final prediction. If the number of trees is set to 100, then there will be 100 simple models that are trained on the data.

What causes overfitting in random forest? ›

We can clearly see that the Random Forest model is overfitting when the parameter value is very low (when parameter value < 100), but the model performance quickly rises up and rectifies the issue of overfitting (100 < parameter value < 400).

Why is logistic regression better than random forest? ›

In general, logistic regression performs better when the number of noise variables is less than or equal to the number of explanatory variables and random forest has a higher true and false positive rate as the number of explanatory variables increases in a dataset.

What is decision tree and random forest in machine learning? ›

A random forest is simply a collection of decision trees whose results are aggregated into one final result. Their ability to limit overfitting without substantially increasing error due to bias is why they are such powerful models. One way Random Forests reduce variance is by training on different samples of the data.

What is the difference between a classification tree and a decision tree? ›

The primary difference between classification and regression decision trees is that, the classification decision trees are built with unordered values with dependent variables. The regression decision trees take ordered values with continuous values.

How do you train a decision tree classifier in Python? ›

Decision Tree Classifier Building in Scikit-learn
  1. Importing Required Libraries. Let's first load the required libraries. ...
  2. Loading Data. ...
  3. Feature Selection. ...
  4. Splitting Data. ...
  5. Building Decision Tree Model. ...
  6. Evaluating the Model. ...
  7. Visualizing Decision Trees.

Does random forest require tree pruning? ›

Unlike a tree, no pruning takes place in random forest; i.e, each tree is grown fully. In decision trees, pruning is a method to avoid overfitting. Pruning means selecting a subtree that leads to the lowest test errror rate.

How deep should trees be random forest? ›

Generally, we go with a max depth of 3, 5, or 7. max_features: The number of columns that are shown to each decision tree. The specific features that are passed to each decision tree can vary between each decision tree.

Are trees trained sequentially in random forest? ›

The random forests is a collection of multiple decision trees which are trained independently of one another. So there is no notion of sequentially dependent training (which is the case in boosting algorithms). As a result of this, as mentioned in another answer, it is possible to do parallel training of the trees.

In which case random forest is better than XGBoost? ›

If the field of study is bioinformatics or multiclass object detection, Random Forest is the best choice as it is easy to tune and works well even if there are lots of missing data and more noise. Overfitting will not happen easily. With accurate results, XGBoost is hard to work with if there are lots of noise.

What is the difference between random forest and gradient boosting tree? ›

The main difference between random forests and gradient boosting lies in how the decision trees are created and aggregated. Unlike random forests, the decision trees in gradient boosting are built additively; in other words, each decision tree is built one after another.

How does random forest overcome overfitting of a decision tree? ›

Random forests deals with the problem of overfitting by creating multiple trees, with each tree trained slightly differently so it overfits differently. Random forests is a classifier that combines a large number of decision trees. The decisions of each tree are then combined to make the final classification.

Which of the two features are most important in random forest model? ›

Permutation Based Feature Importance (with scikit-learn )

This method will randomly shuffle each feature and compute the change in the model's performance. The features which impact the performance the most are the most important one. The permutation based importance is computationally expensive.

Why random forest is better than deep learning? ›

Random Forest is less computationally expensive and does not require a GPU to finish training. A random forest can give you a different interpretation of a decision tree but with better performance. Neural Networks will require much more data than an everyday person might have on hand to actually be effective.

What are the drawbacks of random forest? ›

Disadvantages of random forests

Prediction accuracy on complex problems is usually inferior to gradient-boosted trees. A forest is less interpretable than a single decision tree. Single trees may be visualized as a sequence of decisions.

What is the advantage of random forest tree? ›

Among all the available classification methods, random forests provide the highest accuracy. The random forest technique can also handle big data with numerous variables running into thousands. It can automatically balance data sets when a class is more infrequent than other classes in the data.

What is the most important tree in the forest? ›

Malabar Kino- A medium to large, deciduous tree which can grow to a height of up to 30 m (98 ft). It is native to India, Nepal and Sri Lanka (where it occurs in parts of the Western Ghats in the Karnataka-Kerala region and in the Central India forests).

Is random forest a tree based learner? ›

Random forest is one of the most popular tree-based supervised learning algorithms. It is also the most flexible and easy to use. The algorithm can be used to solve both classification and regression problems.

Is random forest a tree algorithm? ›

Random forest is a commonly-used machine learning algorithm trademarked by Leo Breiman and Adele Cutler, which combines the output of multiple decision trees to reach a single result. Its ease of use and flexibility have fueled its adoption, as it handles both classification and regression problems.

Is random forest tree based model? ›

A. Random Forest is a supervised learning algorithm that works on the concept of bagging. In bagging, a group of models is trained on different subsets of the dataset, and the final output is generated by collating the outputs of all the different models. In the case of random forest, the base model is a decision tree.

How do you avoid overfitting in random forest classifier? ›

To avoid over-fitting in random forest, the main thing you need to do is optimize a tuning parameter that governs the number of features that are randomly chosen to grow each tree from the bootstrapped data.

Is random forest good for large dataset? ›

Random forests is great with high dimensional data since we are working with subsets of data. It is faster to train than decision trees because we are working only on a subset of features in this model, so we can easily work with hundreds of features.

What is the maximum number of trees in random forest? ›

They suggest that a random forest should have a number of trees between 64 - 128 trees.

Does increasing number of trees in random forest cause overfitting? ›

Random Forests do not overfit. The testing performance of Random Forests does not decrease (due to overfitting) as the number of trees increases. Hence after certain number of trees the performance tend to stay in a certain value.

What does number of trees mean in random forest? ›

The number of trees parameter in a random forest model determines the number of simple models, or the number of decision trees, that are combined to create the final prediction. If the number of trees is set to 100, then there will be 100 simple models that are trained on the data.

How many features is too many for random forest? ›

More data is better for neural networks as those networks select the best possible features out of the data on their own. Also, 175 features is too much and you should definitely look into dimensionality reduction techniques and select the features which are highly correlated with the target.

How many trees do you get in forest app? ›

Each Forest app user is allowed to plant up to five trees.

How do I get rid of overfitting in random forest? ›

How to prevent overfitting in random forests
  1. Reduce tree depth. If you do believe that your random forest model is overfitting, the first thing you should do is reduce the depth of the trees in your random forest model. ...
  2. Reduce the number of variables sampled at each split. ...
  3. Use more data.

What is the difference between a random forest and a classification tree? ›

A decision tree combines some decisions, whereas a random forest combines several decision trees. Thus, it is a long process, yet slow. Whereas, a decision tree is fast and operates easily on large data sets, especially the linear one. The random forest model needs rigorous training.

When should you not use random forest? ›

The main limitation of random forest is that a large number of trees can make the algorithm too slow and ineffective for real-time predictions. In general, these algorithms are fast to train, but quite slow to create predictions once they are trained.

What is the disadvantage of random forest? ›

Disadvantages of random forests

A forest is less interpretable than a single decision tree. Single trees may be visualized as a sequence of decisions. A trained forest may require significant memory for storage, due to the need for retaining the information from several hundred individual trees.

Why am I getting 100% accuracy for random forest? ›

That you have 100% train and test accuracy probably means that your model is massively overfitting because of your amount of data. But in general you should avoid overfitting as well as underfitting because both damage your performance of machine learning algorithms.

Why is Forest app so successful? ›

Mobile Addiction

Forest's success comes from the subtle concept to shift users' attention from other mobile temptations to the main elements of the app. This mechanism allows users to reduce procrastination while the retention rate grows exponentially.

Why is the forest app so good? ›

In terms of features, it is similar in terms of it being a study timer, and also focusing on gamification using plants. Records allow you to stay on track of the time you focused so that you can better plan your tasks. Rewards give gamified tasks, and rewards help you stay on track of your focus goals.

How many trees are needed for 1 human? ›

It is proposed that one large tree can provide a day's supply of oxygen for up to four people. Trees also store carbon dioxide in their fibers helping to clean the air and reduce the negative effects that this CO2 could have had on our environment.

Videos

1. Tutorial 43-Random Forest Classifier and Regressor
(Krish Naik)
2. How To visualize Decision Tree In Random Forest- Machine Learning
(Krish Naik)
3. Feature Selection Embedded Method Tree Based Algorithm Random Forest |Tutorial 11
(Atul Patel)
4. Random Forest In R | Random Forest Algorithm | Random Forest Tutorial |Machine Learning |Simplilearn
(Simplilearn)
5. Random Forest Algorithm | Random Forests Machine Learning Tutorial-Machine Learning Algorithms 2018
(ACADGILD)
6. Random Forest Algorithm | Random Forest Complete Explanation | Data Science Training | Edureka
(edureka!)
Top Articles
Latest Posts
Article information

Author: Kelle Weber

Last Updated: 14/10/2023

Views: 5613

Rating: 4.2 / 5 (53 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Kelle Weber

Birthday: 2000-08-05

Address: 6796 Juan Square, Markfort, MN 58988

Phone: +8215934114615

Job: Hospitality Director

Hobby: tabletop games, Foreign language learning, Leather crafting, Horseback riding, Swimming, Knapping, Handball

Introduction: My name is Kelle Weber, I am a magnificent, enchanting, fair, joyous, light, determined, joyous person who loves writing and wants to share my knowledge and understanding with you.