Text Classification with Python and Scikit-Learn (2023)


Text classification is one of the most important tasks in Natural Language Processing. It is the process of classifying text strings or documents into different categories, depending upon the contents of the strings. Text classification has a variety of applications, such as detecting user sentiment from a tweet, classifying an email as spam or ham, classifying blog posts into different categories, automatic tagging of customer queries, and so on.

In this article, we will see a real-world example of text classification. We will train a machine learning model capable of predicting whether a given movie review is positive or negative. This is a classic example of sentimental analysis where people's sentiments towards a particular entity are classified into different categories.


The dataset that we are going to use for this article can be downloaded from the Cornell Natural Language Processing Group. The dataset consists of a total of 2000 documents. Half of the documents contain positive reviews regarding a movie while the remaining half contains negative reviews. Further details regarding the dataset can be found at this link.

Unzip or extract the dataset once you download it. Open the folder "txt_sentoken". The folder contains two subfolders: "neg" and "pos". If you open these folders, you can see the text documents containing movie reviews.

Sentiment Analysis with Scikit-Learn

Now that we have downloaded the data, it is time to see some action. In this section, we will perform a series of steps required to predict sentiments from reviews of different movies. These steps can be used for any text classification task. We will use Python's Scikit-Learn library for machine learning to train a text classification model.

Following are the steps required to create a text classification model in Python:

  1. Importing Libraries
  2. Importing The dataset
  3. Text Preprocessing
  4. Converting Text to Numbers
  5. Training and Test Sets
  6. Training Text Classification Model and Predicting Sentiment
  7. Evaluating The Model
  8. Saving and Loading the Model

Importing Libraries

Execute the following script to import the required libraries:

import numpy as npimport reimport nltkfrom sklearn.datasets import load_filesnltk.download('stopwords')import picklefrom nltk.corpus import stopwords

Importing the Dataset

We will use the load_files function from the sklearn_datasets library to import the dataset into our application. The load_files function automatically divides the dataset into data and target sets. For instance, in our case, we will pass it the path to the "txt_sentoken" directory. The load_files will treat each folder inside the "txt_sentoken" folder as one category and all the documents inside that folder will be assigned its corresponding category.

Execute the following script to see load_files function in action:

movie_data = load_files(r"D:\txt_sentoken")X, y = movie_data.data, movie_data.target

In the script above, the load_files function loads the data from both "neg" and "pos" folders into the X variable, while the target categories are stored in y. Here X is a list of 2000 string type elements where each element corresponds to single user review. Similarly, y is a numpy array of size 2000. If you print y on the screen, you will see an array of 1s and 0s. This is because, for each category, the load_files function adds a number to the target numpy array. We have two categories: "neg" and "pos", therefore 1s and 0s have been added to the target array.

Text Preprocessing

Once the dataset has been imported, the next step is to preprocess the text. Text may contain numbers, special characters, and unwanted spaces. Depending upon the problem we face, we may or may not need to remove these special characters and numbers from text. However, for the sake of explanation, we will remove all the special characters, numbers, and unwanted spaces from our text. Execute the following script to preprocess the data:

(Video) Text Classification With Python

documents = []from nltk.stem import WordNetLemmatizerstemmer = WordNetLemmatizer()for sen in range(0, len(X)): document = re.sub(r'\W', ' ', str(X[sen])) document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document) document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) document = re.sub(r'\s+', ' ', document, flags=re.I) document = re.sub(r'^b\s+', '', document) document = document.lower() document = document.split() document = [stemmer.lemmatize(word) for word in document] document = ' '.join(document) documents.append(document)

In the script above we use Regex Expressions from Python re library to perform different preprocessing tasks. We start by removing all non-word characters such as special characters, numbers, etc.

Next, we remove all the single characters. For instance, when we remove the punctuation mark from "David's" and replace it with a space, we get "David" and a single character "s", which has no meaning. To remove such single characters we use \s+[a-zA-Z]\s+ regular expression which substitutes all the single characters having spaces on either side, with a single space.

Next, we use the \^[a-zA-Z]\s+ regular expression to replace a single character from the beginning of the document, with a single space. Replacing single characters with a single space may result in multiple spaces, which is not ideal.

We again use the regular expression \s+ to replace one or more spaces with a single space. When you have a dataset in bytes format, the alphabet letter "b" is appended before every string. The regex ^b\s+ removes "b" from the start of a string. The next step is to convert the data to lower case so that the words that are actually the same but have different cases can be treated equally.

The final preprocessing step is the lemmatization. In lemmatization, we reduce the word into dictionary root form. For instance "cats" is converted into "cat". Lemmatization is done in order to avoid creating features that are semantically similar but syntactically different. For instance, we don't want two different features named "cats" and "cat", which are semantically similar, therefore we perform lemmatization.

Converting Text to Numbers

Machines, unlike humans, cannot understand the raw text. Machines can only see numbers. Particularly, statistical techniques such as machine learning can only deal with numbers. Therefore, we need to convert our text into numbers.

Different approaches exist to convert text into the corresponding numerical form. The Bag of Words Model and the Word Embedding Model are two of the most commonly used approaches. In this article, we will use the bag of words model to convert our text to numbers.

Bag of Words

The following script uses the bag of words model to convert text documents into corresponding numerical features:

from sklearn.feature_extraction.text import CountVectorizervectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))X = vectorizer.fit_transform(documents).toarray()

The script above uses CountVectorizer class from the sklearn.feature_extraction.text library. There are some important parameters that are required to be passed to the constructor of the class. The first parameter is the max_features parameter, which is set to 1500. This is because when you convert words to numbers using the bag of words approach, all the unique words in all the documents are converted into features. All the documents can contain tens of thousands of unique words. But the words that have a very low frequency of occurrence are unusually not a good parameter for classifying documents. Therefore we set the max_features parameter to 1500, which means that we want to use 1500 most occurring words as features for training our classifier.

The next parameter is min_df and it has been set to 5. This corresponds to the minimum number of documents that should contain this feature. So we only include those words that occur in at least 5 documents. Similarly, for the max_df, feature the value is set to 0.7; in which the fraction corresponds to a percentage. Here 0.7 means that we should include only those words that occur in a maximum of 70% of all the documents. Words that occur in almost every document are usually not suitable for classification because they do not provide any unique information about the document.

Finally, we remove the stop words from our text since, in the case of sentiment analysis, stop words may not contain any useful information. To remove the stop words we pass the stopwords object from the nltk.corpus library to the stop_wordsparameter.

(Video) End to End Text Classification using Python and Scikit learn

The fit_transform function of the CountVectorizer class converts text documents into corresponding numeric features.

Finding TFIDF

The bag of words approach works fine for converting text to numbers. However, it has one drawback. It assigns a score to a word based on its occurrence in a particular document. It doesn't take into account the fact that the word might also be having a high frequency of occurrence in other documents as well. TFIDF resolves this issue by multiplying the term frequency of a word by the inverse document frequency. The TF stands for "Term Frequency" while IDF stands for "Inverse Document Frequency".

The term frequency is calculated as:

Term frequency = (Number of Occurrences of a word)/(Total words in the document)

And the Inverse Document Frequency is calculated as:

IDF(word) = Log((Total number of documents)/(Number of documents containing the word))

The TFIDF value for a word in a particular document is higher if the frequency of occurrence of that word is higher in that specific document but lower in all the other documents.

To convert values obtained using the bag of words model into TFIDF values, execute the following script:

from sklearn.feature_extraction.text import TfidfTransformertfidfconverter = TfidfTransformer()X = tfidfconverter.fit_transform(X).toarray()

You can also directly convert text documents into TFIDF feature values (without first converting documents to bag of words features) using the following script:

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

(Video) Text classification using sklearn
from sklearn.feature_extraction.text import TfidfVectorizertfidfconverter = TfidfVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))X = tfidfconverter.fit_transform(documents).toarray()

Training and Testing Sets

Like any other supervised machine learning problem, we need to divide our data into training and testing sets. To do so, we will use the train_test_split utility from the sklearn.model_selection library. Execute the following script:

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

The above script divides data into 20% test set and 80% training set.

Training Text Classification Model and Predicting Sentiment

We have divided our data into training and testing set. Now is the time to see the real action. We will use the Random Forest Algorithm to train our model. You can you use any other model of your choice.

To train our machine learning model using the random forest algorithm we will use RandomForestClassifier class from the sklearn.ensemble library. The fit method of this class is used to train the algorithm. We need to pass the training data and training target sets to this method. Take a look at the following script:

classifier = RandomForestClassifier(n_estimators=1000, random_state=0)classifier.fit(X_train, y_train) 

Finally, to predict the sentiment for the documents in our test set we can use the predict method of the RandomForestClassifier class as shown below:

y_pred = classifier.predict(X_test)

Congratulations, you have successfully trained your first text classification model and have made some predictions. Now is the time to see the performance of the model that you just created.

Evaluating the Model

To evaluate the performance of a classification model such as the one that we just trained, we can use metrics such as the confusion matrix, F1 measure, and the accuracy.

To find these values, we can use classification_report, confusion_matrix, and accuracy_score utilities from the sklearn.metrics library. Execute the following script to do so:

from sklearn.metrics import classification_report, confusion_matrix, accuracy_scoreprint(confusion_matrix(y_test,y_pred))print(classification_report(y_test,y_pred))print(accuracy_score(y_test, y_pred))

The output looks like this:

[[180 28] [ 30 162]] precision recall f1-score support 0 0.86 0.87 0.86 208 1 0.85 0.84 0.85 192avg / total 0.85 0.85 0.85 4000.855

From the output, it can be seen that our model achieved an accuracy of 85.5%, which is very good given the fact that we randomly chose all the parameters for CountVectorizer as well as for our random forest algorithm.

(Video) Multi-Label Text Classification with Scikit-MultiLearn in Python

Saving and Loading the Model

In the script above, our machine learning model did not take much time to execute. One of the reasons for the quick training time is the fact that we had a relatively smaller training set. We had 2000 documents, of which we used 80% (1600) for training. However, in real-world scenarios, there can be millions of documents. In such cases, it can take hours or even days (if you have slower machines) to train the algorithms. Therefore, it is recommended to save the model once it is trained.

We can save our model as a pickle object in Python. To do so, execute the following script:

with open('text_classifier', 'wb') as picklefile: pickle.dump(classifier,picklefile)

Once you execute the above script, you can see the text_classifier file in your working directory. We have saved our trained model and we can use it later for directly making predictions, without training.

To load the model, we can use the following code:

with open('text_classifier', 'rb') as training_model: model = pickle.load(training_model)

We loaded our trained model and stored it in the model variable. Let's predict the sentiment for the test set using our loaded model and see if we can get the same results. Execute the following script:

y_pred2 = model.predict(X_test)print(confusion_matrix(y_test, y_pred2))print(classification_report(y_test, y_pred2))print(accuracy_score(y_test, y_pred2)) 

The output looks like this:

[[180 28] [ 30 162]] precision recall f1-score support 0 0.86 0.87 0.86 208 1 0.85 0.84 0.85 192avg / total 0.85 0.85 0.85 4000.855

The output is similar to the one we got earlier which showed that we successfully saved and loaded the model.

Going Further - Hand-Held End-to-End Project

Your inquisitive nature makes you want to go further? We recommend checking out our Guided Project: "Image Captioning with CNNs and Transformers with Keras".

In this guided project - you'll learn how to build an image captioning model, which accepts an image as input and produces a textual caption as the output.

You'll learn how to:

  • Preprocess text
  • Vectorize text input easily
  • Work with the tf.data API and build performant Datasets
  • Build Transformers from scratch with TensorFlow/Keras and KerasNLP - the official horizontal addition to Keras for building state-of-the-art NLP models
  • Build hybrid architectures where the output of one network is encoded for another

How do we frame image captioning? Most consider it an example of generative deep learning, because we're teaching a network to generate descriptions. However, I like to look at it as an instance of neural machine translation - we're translating the visual features of an image into words. Through translation, we're generating a new representation of that image, rather than just generating new meaning. Viewing it as translation, and only by extension generation, scopes the task in a different light, and makes it a bit more intuitive.

Framing the problem as one of translation makes it easier to figure out which architecture we'll want to use. Encoder-only Transformers are great at understanding text (sentiment analysis, classification, etc.) because Encoders encode meaningful representations. Decoder-only models are great for generation (such as GPT-3), since decoders are able to infer meaningful representations into another sequence with the same meaning. Translation is typically done by an encoder-decoder architecture, where encoders encode a meaningful representation of a sentence (or image, in our case) and decoders learn to turn this sequence into another meaningful representation that's more interpretable for us (such as a sentence).

(Video) Using BERT with Scikit Learn to do Text classification | python


Text classification is one of the most commonly used NLP tasks. In this article, we saw a simple example of how text classification can be performed in Python. We performed the sentimental analysis of movie reviews.

I would advise you to change some other machine learning algorithm to see if you can improve the performance. Also, try to change the parameters of the CountVectorizerclass to see if you can get any improvement.


How do you classify text in Python? ›

Let's divide the classification problem into the below steps: Setup: Importing Libraries. Loading the data set & Exploratory Data Analysis. Text pre-processing.
  1. Step 1: Importing Libraries. ...
  2. Step 2: Loading the data set & EDA. ...
  3. Step 3: Text Pre-Processing. ...
  4. Step 4: Extracting vectors from text (Vectorization)
Mar 31, 2021

What is the best classifier for text classification? ›

Linear Support Vector Machine is widely regarded as one of the best text classification algorithms.

How do you use SVM for text classification in Python? ›

Creating a Text Classifier with SVM
  1. Choose Model. Click on create a model. ...
  2. Choose Classification Type. Now, you will have to choose the type of classification task you would like to perform. ...
  3. Import Data. Now it's time to import your data:
  4. Define Tags. ...
  5. Train Model.

What is text classification example? ›

Some examples of text classification are: Understanding audience sentiment from social media, Detection of spam and non-spam emails, Auto tagging of customer queries, and.

Is XGBoost good for text classification? ›

XGBoost is the name of a machine learning method. It can help you to predict any kind of data if you have already predicted data before. You can classify any kind of data. It can be used for text classification too.

What is text classification in NLP? ›

Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content.

Why do we use scikit-learn library in ML? ›

Scikit-learn is probably the most useful library for machine learning in Python. The sklearn library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction.

Which algorithm is used for text classification? ›

The Naive Bayes family of statistical algorithms are some of the most used algorithms in text classification and text analysis, overall.

Why is CNN better in text classification? ›

CNN utilizes an activation function which helps it run in kernel (i.e) high dimensional space for neural processing. For Natural language processing, text classification is a topic in which one needs to set predefined classes to free-text documents.

Can we use CNN for text classification? ›

Here we have seen the text classification model with very basic levels. There are many methods to perform text classification. TextCNN is also a method that implies neural networks for performing text classification. First, let's look at CNN; after that, we will use it for text classification.

Why is SVM good for text classification? ›

With their ability to generalize well in high dimensional feature spaces, SVMs eliminate the need for feature selection, making the ap- plication of text categorization considerably easier. Another advantage of SVMs over the conventional methods is their robustness.

Which is better SVM or naive Bayes? ›

The consensus for ML researchers and practitioners is that in almost all cases, the SVM is better than the Naive Bayes. From a theoretical point of view, it is a little bit hard to compare the two methods. One is probabilistic in nature, while the second one is geometric.

Is Random Forest good for text classification? ›

The Random Forest (RF) classifiers are suitable for dealing with the high dimensional noisy data in text classification. An RF model comprises a set of decision trees each of which is trained using random subsets of features.

How do you use Bert for text classification? ›

In this notebook, you will:
  1. Load the IMDB dataset.
  2. Load a BERT model from TensorFlow Hub.
  3. Build your own model by combining BERT with a classifier.
  4. Train your own model, fine-tuning BERT as part of that.
  5. Save your model and use it to classify sentences.
Mar 29, 2022

Is logistic regression good for text classification? ›

More importantly, in the NLP world, it's generally accepted that Logistic Regression is a great starter algorithm for text related classification.

How do you classify a text? ›

Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content.

How do you make a text classification? ›

Text Classification Workflow
  1. Step 1: Gather Data.
  2. Step 2: Explore Your Data.
  3. Step 2.5: Choose a Model*
  4. Step 3: Prepare Your Data.
  5. Step 4: Build, Train, and Evaluate Your Model.
  6. Step 5: Tune Hyperparameters.
  7. Step 6: Deploy Your Model.

How do you classify text in NLP? ›

Words and Sequences
  1. Text classification. Text clarification is the process of categorizing the text into a group of words. ...
  2. Vector Semantic. Vector Semantic is another way of word and sequence analysis. ...
  3. Word Embedding. ...
  4. Probabilistic Language Model. ...
  5. Sequence Labeling.
Dec 11, 2020

How do you create a text classification model? ›

Building a Supervised Text Classification Model - YouTube


1. Machine Learning - Text Classification with Python, nltk, Scikit & Pandas
(Johannes Frey)
2. Real-World Python Machine Learning Tutorial w/ Scikit Learn (sklearn basics, NLP, classifiers, etc)
(Keith Galli)
3. IML10: How to train and test classification models in Python using Scikit-learn
(Bevan Smith Data Science)
4. Text Classification using spaCy v3.0 transformers in Python #nlp #tutorial #ai
(Rithesh Sreenivasan)
5. Text Classification & ML Model Interpretation with Eli5,Spacy and Sklearn
6. Text Classification with Machine Learning,SpaCy and Scikit(Sentiment Analysis)
Top Articles
Latest Posts
Article information

Author: Greg O'Connell

Last Updated: 11/02/2022

Views: 5748

Rating: 4.1 / 5 (62 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Greg O'Connell

Birthday: 1992-01-10

Address: Suite 517 2436 Jefferey Pass, Shanitaside, UT 27519

Phone: +2614651609714

Job: Education Developer

Hobby: Cooking, Gambling, Pottery, Shooting, Baseball, Singing, Snowboarding

Introduction: My name is Greg O'Connell, I am a delightful, colorful, talented, kind, lively, modern, tender person who loves writing and wants to share my knowledge and understanding with you.