Sentiment Analysis Using Python

Sentiment analysis involves analyzing the opinions about a product or service expressed in the form of a text and categorizing those opinions to draw meaningful insights. Generally, these opinions are categorized into positive, negative, or neutral.

Sentiment Analysis Using Python Overview

In this blog, we’ll build a 2-way polarity (positive, negative, or neutral) classification system for stock prices based on the news headlines without using NLTK’s (Natural Language Toolkit) in-built sentiment analysis engine. We will use a Logistic Regression classifier, Bag-of-Words features, Random Forest classifier, and Multinomial Naive Bayes classifier. To see which performs best, we will also create our preprocessing module to handle raw news headlines.

Data Used

  • The news headlines dataset has 4,100 rows and 27 columns
  • Each row has the date of headlines, label (0 – Stock price goes down or stays the same, 1 – Stock price goes up), and top 25 headlines.
  • Positive stock sentiment is slightly higher in the given dataset
Sentiment Analysis Using Python Stock Sentiments Graph

Data Cleansing and Preprocessing

The first thing that we’ll do is to clean the data by filtering out null values and then preprocessing the headlines so that they’re easier to deal with and ready for feature extraction and training by the classifiers.

After cleaning the data, we’re going to split the data into train and test sets and extract the labels from both sets.

Sentiment Analysis Using Python Train and Test Sets

Split the data sets to extract Y labels.

Sentiment Analysis Using Python Split Data

Then for the preprocessing, we’ll first install NLTK libraries to process the data sets.

Sentiment Analysis Using Python Install NLTK Libraries

Next, we remove special characters and punctuation marks, rename columns and make all words lowercase. Finally, we combine all the columns into one.

Sentiment Analysis Using Python Remove Special Characters

After the above steps, the final output will be as follows:

Sentiment Analysis Using Python Final Output

Now, we create a word corpus by tokenizing the preprocessed headlines, removing stop words and joining the stemmed words.

Sentiment Analysis Using Python Create Corpus for Test Data Set

Similarly, we can create a corpus for the test data set as well.

Sentiment Analysis Using Python Create Word Corpus

From this corpus, we can create a word cloud of down words (negative impact on stock market) and up words (positive impact).

Sentiment Analysis Using Python Word Cloud 1
Sentiment Analysis Using Python Word Cloud 2

With the available corpus, we extract features by using a bag of words model that can be used with machine learning algorithms to calculate scores.

Sentiment Analysis Using Python Bag of Words

Model Building

Now, we will run our data through a Logistic Regression, Random Forest classifier, and Multinomial Naive Bayes classifier. We also calculate the performance measures like accuracy, precision, and recall for the corresponding algorithms to select a model with the best results.

Logistic Regression

Sentiment Analysis Using Python Random Forest Classification

Confusion matrix is a performance measurement for classification algorithms. It is a matrix that gives a visual representation of True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).

  • Accuracy (all correct / all) = TP + TN / TP + TN + FP + FN
  • Precision (true positives / predicted positives) = TP / TP + FP
  • Recall (true positives / all actual positives) = TP / TP + FN
Sentiment Analysis Using Python Making the Confusion Matrix 1
Sentiment Analysis Using Python Confusion Matrix for Logistic Regression Algorithm

Logistic regression gave us an accuracy of 86% which is a great start. We continue to use other algorithms to see if they give better results.

Random Forest Classification

Sentiment Analysis Using Python Random Forest Classification 1

Create a confusion matrix similar to logistic regression.

Sentiment Analysis Using Python Making the Confusion Matrix
Sentiment Analysis Using Python Confusion Matrix for Random Forest Algorithm

Random Forest classifier gave an accuracy score of 84% which is less than the Logistic regression.

Multinomial Naive Bayes

Sentiment Analysis Using Python Random Forest Classification 2
Sentiment Analysis Using Python Making the Confusion Matrix 3
Sentiment Analysis Using Python Confusion Matrix for Multinomial Naive Bayes

To recap what just happened, we created a Logistic regression classifier and its Confusion Matrix. We also created a Random Forest classifier and Naive Bayes classifier to see which gives the maximum accuracy.

The table below lists the different classifiers and their accuracy scores.

ClassifierAccuracy Score
Logistic Regression classifier86%
Random Forest classifier84%
Naive Bayes classifier84%

As we can see, the Logistic classifier is better than the other classifiers. So, we use the LR classifier for the predictions.

Predictions

We can build a function to put all the steps in one place and test some of the headlines to see if the stock price will go up or down.

Sentiment Analysis Using Python Predictions Function
Sentiment Analysis Using Python Generating Random Integer

Example 1:

Sentiment Analysis Using Python Predicting Values 1

Example 2:

Sentiment Analysis Using Python Predicting Values 2

Drawing insights from social media posts and other sources is imperative for businesses in this information age because an abundance of information is generated in mere fractions of seconds on the Internet. In this blog, we have covered what Sentiment Analysis is, and how we can analyze given data using Python. However, these were just some basic ways to perform sentiment analysis. We can explore more models to use on our data.

If you have any questions about this blog or need help with sentiment analysis and other machine learning services, please contact us.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top