Comparison of machine learning methods in email spam detection (2022)

Unsolicited bulk emails, also known as Spam, make up for approximately 60% of the global email traffic. Despite the fact that technology has advanced in the field of Spam detection since the first unsolicited bulk email was sent in 1978 spamming remains a time consuming and expensive problem.

This report compares the performance of three machine learning techniques for spam detection including Random Forest (RF), k-Nearest Neighbours (kNN) and Support Vector Machines (SVM).

Introduction

Despite the rising popularity of instant messaging technologies in recent years, email continues to be the dominant medium for digital communications for both consumer and business use. Following industry estimations (Symantec Corporation, 2016, pp 31 1), approximately 200 billion emails were sent each day in 2015. On average, business users sent and received around 42 emails per day. Given those facts, it is no wonder that email is still the weapon of choice for cybercriminals who want to target the broadest possible audience electronically.

According to Nucleus Research (Nucleus Research, 2007 2), spam costs US businesses an average of $712 per employee every year due to diminished productivity, lost customers, spent bandwidth and increasing the cost of maintenance.

Estimates (Statista, 2017 3) are that slightly less than 60 percent of the incoming business email traffic is unsolicited bulk email (known as spam) which was the lowest level since 2003. However, even though the global percentage of spam/ non-spam ratio is decreasing, the competition between spammers and spam filtering techniques continuous. It is fair to say that the problem is not going away, and the need for reliable anti-spam filters remains high.

The idea of automatically classifying spam and non-spam emails by applying machine learning methods has been pretty popular in academia and has been a topic of interest for many researchers.

Knowledge engineering and machine learning are the two main approaches scientists have been applied to overcome the spam-filtering problem. The first solution focuses on creating a knowledge-based system in which pre-defined rules dictate if an incoming message is legitimate or not. The primary disadvantage of this method is that those rules need to be maintained and updated continuously by the user or a 3rd party like for example a software vendor.

The machine learning approach, in contrast, does not require pre-defined rules, but instead messages which have been successfully pre-classified. Those messages make the training dataset which is being used to fit the learning algorithm to the model. One could say the algorithm defers the classification rules from the test data.

(Video) Project 17. Spam Mail Prediction using Machine Learning with Python | Machine Learning Projects

This study compares three algorithms which are suitable for classification problems. In particular, we included the following methods:

  • Random Forest
  • k-Nearest Neighbours
  • Support Vector Machines with Linear Kernel

For the experiment, we use Hewlett Packard’s Spambase dataset which is publicly available and downloadable from the UCI Machine Learning Repository.

Methods

The following part provides a brief introduction to the three methods used for the experiment and compares general advantages and disadvantages.

Random Forest

Tin Kam Ho first introduced the general method of random decision forests at AT&T Bell Labs in 1995 (Tin Kam Ho, 1995 4). The thought is, that

If one tree is good, then many trees (a forest) should be better.

Stephen Marsland, 2014, p. 275 5

The algorithm deducts the classification label for new documents from a set of decision trees where for each tree, a sample is selected from the training data, and a decision tree is created by choosing a random subset of all features (hence “Random”). The algorithm is suitable for complex classification tasks in small datasets (Breiman, 2001 6). By averaging multiple trees, random-forest-based models have a significantly lower risk of overfitting and include less variance compared to decision trees. The major drawback is performance as a large number of trees may make the method slow for real-time prediction.

k-Nearest Neighbours

The k-nearest neighbour (kNN) classifier is a straightforward method and works well for simple recognition problems. It is considered as an example-based classifier because the training data is used for comparison and not for explicit category representation. In literature, the term lazy-learner is also often related to kNN.

(Video) Use Elasticsearch Machine Learning for Email Spam Detection

When a new document needs to be categorised, kNN tries to find the k nearest neighbours (most similar documents) in the training dataset. Given that, enough neighbours are found and have been categorised, kNN uses their profile to assign the new document to the same category. This comparison is a real-time process, and therefore the main drawback of this approach is that the kNN algorithm must compute the distance and sort all the training data for each prediction, which can be slow if given a large training dataset (James, Witten, Hastie, & Tibshirani, 2013, pp. 39–42 7).

Support Vector Machines

The original Support Vector Machines algorithm was designed by Vladimir N. Vapnik and Alexey Ya. Chervonenkis in 1963 (Vapnik & Chervonenkis, 1964 8). SVM has its foundation in the broad concept of decision planes which define the decision boundaries. Decision planes separate distinct objects by finding the optimal hyperplane with the maximum margin between two separate classes.SVM provides high accuracy on small and clean datasets but tends to perform less efficient on noisier datasets with overlapping classes (James et al., 2013, pp. 349–359 7).

Procedure

The following part describes the experiment procedure including exploratory data analysis, model fitting, evaluation and prediction.

Exploratory Data Analysis

The Spambase dataset was composed by Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt at Hewlett-Packard Labs. The set includes a total of 4601 observations from Mr Foreman’s personal email account, 2788 messages are classified as Non-Spam and 1813 were effectively Spam (cf. figure 1).

Comparison of machine learning methods in email spam detection (1)

58 different attributes were computed of which 57 are continuous and one is a nominal class label. Typically, documents are represented as vectors of word frequencies. The dataset includes measurements for 6 character frequencies and 48 different word frequencies such as “Internet”, “George” (Mr Foreman’s first name), “Credit”. Furthermore, three data points were collected which represent the average, the maximum and the total length of character sequences in uppercase (cf. figure 2).

Comparison of machine learning methods in email spam detection (2)

Feature Selection

A widely used algorithm for automatic feature selection is Recursive Feature Elimination or RFE. It is based on the idea of repeatedly constructing models and select either the worst- or best-performing feature. RFE than removes the feature from the stack and repeats the process with the remaining features in the set.

(Video) SMS Spam Detection And Comparison Of Various Machine Learning Algorithms

Figure 3 illustrates RFE applied with a Random Forest algorithm to the Spambase dataset. All 57 attributes have been selected in the example, although the plot shows that selecting just 44 attributes provide similar accuracy.

Comparison of machine learning methods in email spam detection (3)

Classification tree

Another handy technique in data mining is recursive partitioning. This method helps to visualise the decision rules for a particular prediction. Figure 4 shows an example of a classification tree on the Spambase dataset.

Comparison of machine learning methods in email spam detection (4)

Training

After we have completed our initial data exploration analysis, we now prepare the data and train our models using the three describe methods. The data preparation involves the following steps:

  • Set human readable column names on the data frame
  • Replace the class data with descriptive label where zero represents “Non-Spam” and a one marks a record as “Spam”
  • Cast the class column to data type factor as the caret package complains if labels are 0 or 1
  • Take samples from 1000 and split those into test and training sets randomly with a training/ test ratio of 70%

Prediction and evaluation

Finally, after we have completed the training step for all three models let us have a look how they compare to each other regarding performance. We compare the performance of all three approaches by evaluating the most commonly used indicators: spam precision (SP), spam recall (SR) and accuracy (A). All three indicators originate from the confusion matrix of each model (cf. figure 5).

  • Spam precision is the percentage of correct results divided by the number of all returned results
  • Spam recall is the percentage of all Spam emails which are correctly classified as Spam
  • The accuracy is the percentage of all emails that are correctly categorised

Comparison of machine learning methods in email spam detection (5)

The table below summarises the performance result of all three machine learning methods. We determine from the results that k-Nearest Neighbours (kNN) and Support Vector Machine (SVM) perform similar weak regarding accuracy and Random Forest (RF) outperforms both. We see that RF and SVM have the same relatively high percentage of spam recall while kNN performs significantly worse in that category. Finally, we learn that RF has the highest percentage of spam precision and SVM almost 10 points less than RF.

(Video) Spam Mail prediction using Machine Learning in Google Colab

AlgorithmSpam Precision (SP)Spam Recall (SR)Accuracy (A)
Random Forest92.6687.0792.31
k-Nearest Neighbours88.0782.7688.96
SVM Linear94.8787.0788.96

Conclusion

By the looks of the result, one could say that using the random forest approach is the gold way, although we need to keep in mind that we have not fine tuned any of those models at all! Therefore due to its design Random Forest performs relatively well “out-of-the-box” compared to k-Nearest Neighbours and Support Vector Machine.

References

  1. Symantec Corporation. (2016). Internet Security Threat Report (Vol. 21).

  2. Nucleus Research. (2007). Spam costing US Businesses $712 Per Employee.

  3. Statista. (2017). Global spam email traffic share 2014-2017.

  4. Tin Kam Ho. (1995). Random decision forests. Proceedings of 3rd International Conference on Document Analysis and Recognition, 1, 278–282.

  5. Stephen Marsland. (2014). Machine Learning: An Algorithmic Perspective (2nd ed.). Chapman; Hall/CRC.

  6. Breiman, L. (2001). Random Forests. Machine Learning, 45 (1), 5–32.

    (Video) Importance of Machine Learning |Spam Email Fighting 1 | Beginner to Advance Course Lecture 5

  7. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning.

  8. Vapnik, V., & Chervonenkis, A. (1964). A note on one class of perceptrons. Automation and Remote Control, 25.

FAQs

What type of learning algorithm will you use for detecting spam mails? ›

SVM algorithms are very potent for the identification of patterns and classifying them into a specific class or group. They can be easily trained and according to some researchers, they outperform many of the popular email spam classification methods [130,131].

Which technique is better for spam detection? ›

Some popular Machine Learning techniques for spam filtering are Naive Bayes, Support Vector Machines, Decision Trees, Neural Networks, etc. The sophistication of Machine learning algorithms makes it one of the best spam filter services among all other spam filtering techniques.

What is the use of machine learning in email spam filtering? ›

Widely used supervised machine learning techniques namely C 4.5 Decision tree classifier, Multilayer Perceptron, Naïve Bayes Classifier are used for learning the features of spam emails and the model is built by training with known spam emails and legitimate emails. The results of the models are discussed.

Is spam detection supervised or unsupervised? ›

Spam detection is a supervised machine learning problem. This means you must provide your machine learning model with a set of examples of spam and ham messages and let it find the relevant patterns that separate the two different categories.

How is email spam detected? ›

Anti-spam software and filters scan emails for red flags. These red flags are based on common attributes of spam messages. In the time an email is sent, to the time it lands in your inbox, filters will examine an email and decide whether it gets delivered to your inbox or into the spam folder.

Which domain of AI is used in email filters? ›

1 Answer. Natural Language Processing is the correct answer.

How does Google detect spam? ›

Our automated systems can detect the vast majority of spam and keep it out of your top Search results, similar to how a good email system keeps spam from flooding your inbox. The rest of spam is tackled manually by our spam removal team, who review pages and flag them if they violate the webmaster guidelines.

How can AI be used to detect and filter out such spam messages? ›

Artificial intelligence and spam filters

Machine Learning allows computers to process data and learn for themselves without being manually programmed. An ML-based spam filter can learn in several ways, but it has to be trained by using a large amount of data from already recognised spam emails and identifying patterns.

Which of the following spam filtering methods are typically used? ›

Blacklisting Certain Ips

Another commonly used spam filtering technique is blacklisting certain IP addresses that are known to be used by spammers. This way, you can prevent spam emails from those IPs.

How is machine learning used in NLP? ›

Machine learning for NLP helps data analysts turn unstructured text into usable data and insights. Text data requires a special approach to machine learning. This is because text data can have hundreds of thousands of dimensions (words and phrases) but tends to be very sparse.

Which of the following is an example of unsupervised learning a learning a spam filter? ›

The spams in emails, filtering of new messages involve the use of email to detect the messages whether it is a spam or not and categorizes it in the right email folder. The classification of heavenly bodies such as stars and planets is automatic; hence it is an example unsupervised Learning.

Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning problem? ›

Would you frame the problem of spam detection as a supervised learning problem or an unsupervised learning problem? Spam detection is a supervised learning problem because the labels are known (spam or no spam).

Why is machine learning important in NLP? ›

Machine Learning is an application of AI that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine Learning can be used to help solve AI problems and to improve NLP by automating processes and delivering accurate responses.

Which 4 actions can be taken for an email that is classified as spam Sophos? ›

You can specify quarantine, encryption, and notification settings.
  • Add an SMTP malware scan policy. With SMTP malware scan policies, you can specify filter criteria and action for malware and attachments in senders' and recipients' emails.
  • Add an SMTP spam scan policy. ...
  • Add a POP-IMAP scan policy.

What is spam in ML? ›

Chat spamming is the repetition of a word or line typed out by a player using a game's chat system.

Is spam detection a classification problem? ›

This is called Spam Detection, and it is a binary classification problem. The reason to do this is simple: by detecting unsolicited and unwanted emails, we can prevent spam messages from creeping into the user's inbox, thereby improving user experience.

What is spam fighting in AI? ›

There are multiple forms of spam that Google fights at different points in which Google interacts with web pages. What Google has done is to create a spam fighting artificial intelligence that Google describes as providing an “unprecedented potential to revolutionize” spam fighting.

What are the 3 domains of AI? ›

The domain of AI is classified into Formal tasks, Mundane tasks, and Expert tasks.

What is spam classifier? ›

A spam message classification is a step towards building a tool for scam message identification and early scam detection. Photo by Markus Winkler on Unsplash. Dataset. The dataset is from Kaggle, a collection of spam SMS messages, with 5572 messages, all classified as either 'ham' or 'spam' .

How good is Google spam filter? ›

Gmail's spam filters don't just curb junk by applying pre-existing rules. They create new rules as they go along.

How does Gmail prevent spam? ›

Gmail administrators should set up email authentication to protect their organization's email. Authentication helps prevent messages from your organization from being marked as spam. It also prevents spammers from impersonating your domain or organization in spoofing and phishing emails.

Does Gmail have spam filtering? ›

Gmail spam filters automatically move spam email messages (sometimes called junk mail) into users' spam folders. You can't turn off Gmail's spam filters, but you can create filters that: Bypass spam classification for messages received from users on an approved senders list that you create.

Which business case is better solved by Artificial Intelligence AI than conventional programming? ›

Answer. Answer: Calculating interest rates for variable-interest rate loans.

How does Google detect spam? ›

Our automated systems can detect the vast majority of spam and keep it out of your top Search results, similar to how a good email system keeps spam from flooding your inbox. The rest of spam is tackled manually by our spam removal team, who review pages and flag them if they violate the webmaster guidelines.

How can AI be used to detect and filter out such spam messages? ›

Artificial intelligence and spam filters

Machine Learning allows computers to process data and learn for themselves without being manually programmed. An ML-based spam filter can learn in several ways, but it has to be trained by using a large amount of data from already recognised spam emails and identifying patterns.

Which domain of AI is used in email filters? ›

1 Answer. Natural Language Processing is the correct answer.

What is spam classifier? ›

A spam message classification is a step towards building a tool for scam message identification and early scam detection. Photo by Markus Winkler on Unsplash. Dataset. The dataset is from Kaggle, a collection of spam SMS messages, with 5572 messages, all classified as either 'ham' or 'spam' .

How good is Google spam filter? ›

Gmail's spam filters don't just curb junk by applying pre-existing rules. They create new rules as they go along.

How does Gmail prevent spam? ›

Gmail administrators should set up email authentication to protect their organization's email. Authentication helps prevent messages from your organization from being marked as spam. It also prevents spammers from impersonating your domain or organization in spoofing and phishing emails.

Does Gmail have spam filtering? ›

Gmail spam filters automatically move spam email messages (sometimes called junk mail) into users' spam folders. You can't turn off Gmail's spam filters, but you can create filters that: Bypass spam classification for messages received from users on an approved senders list that you create.

How is machine learning used in NLP? ›

Machine learning for NLP helps data analysts turn unstructured text into usable data and insights. Text data requires a special approach to machine learning. This is because text data can have hundreds of thousands of dimensions (words and phrases) but tends to be very sparse.

What is spam fighting in AI? ›

There are multiple forms of spam that Google fights at different points in which Google interacts with web pages. What Google has done is to create a spam fighting artificial intelligence that Google describes as providing an “unprecedented potential to revolutionize” spam fighting.

Which of the following is an example of unsupervised learning a learning a spam filter? ›

The spams in emails, filtering of new messages involve the use of email to detect the messages whether it is a spam or not and categorizes it in the right email folder. The classification of heavenly bodies such as stars and planets is automatic; hence it is an example unsupervised Learning.

What are the 3 domains of AI? ›

The domain of AI is classified into Formal tasks, Mundane tasks, and Expert tasks.

Why is machine learning important in NLP? ›

Machine Learning is an application of AI that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine Learning can be used to help solve AI problems and to improve NLP by automating processes and delivering accurate responses.

Which type of machine learning ML platform is TensorFlow and PyTorch? ›

Both TensorFlow and PyTorch are examples of a robust machine learning library. Even though both serve the same purpose, the way they achieve it is different making them suitable for varying requirements. The ML library is developed by Google Brain Team.

Why is naive Bayes good for spam? ›

Naive Bayes work on dependent events and the probability of an event occurring in the future that can be detected from the previous occurring of the same event . This technique can be used to classify spam e-mails, words probabilities play the main rule here.

How is naive Bayes used in spam filtering? ›

Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayes' theorem to calculate a probability that an email is or is not spam.

Is spam detection classification or regression? ›

Logistic regression is one of the most likely and appropriate algorithm used for classification of datasets. In case of classifying a dataset named as spam base the logistic regression is the most versatile decision based approach for detecting spam emails out of a dataset.

Videos

1. Email Spam Classifier | SMS Spam Classifier | End to End Project | Heroku Deployment
(CampusX)
2. Comparison Of Spam Filtering Algorithms
(Ankur Garg)
3. Seminar on Email Spam Detection Using Machine Learning Algorithm
(prajakta rankhambe)
4. Email Spam Detection (How To Run And Output) - Mifratech#bestMLproject#bestelearningproject
(MIFRATECH eLearning)
5. Machine Learning for Security Analysts - Part 2: Building a Spam Filter
(Netsec Explained)
6. SMS Spam Detection | Machine Learning Projects for Beginners | #11
(Mathematics behind Data Science)

Top Articles

Latest Posts

Article information

Author: Fr. Dewey Fisher

Last Updated: 11/01/2022

Views: 5528

Rating: 4.1 / 5 (62 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Fr. Dewey Fisher

Birthday: 1993-03-26

Address: 917 Hyun Views, Rogahnmouth, KY 91013-8827

Phone: +5938540192553

Job: Administration Developer

Hobby: Embroidery, Horseback riding, Juggling, Urban exploration, Skiing, Cycling, Handball

Introduction: My name is Fr. Dewey Fisher, I am a powerful, open, faithful, combative, spotless, faithful, fair person who loves writing and wants to share my knowledge and understanding with you.