Sentiment analysis involves analyzing the opinions about a product or service expressed in the form of a text and categorizing those opinions to draw meaningful insights. Generally, these opinions are categorized into positive, negative, or neutral.
Sentiment Analysis Using Python Overview
In this blog, we’ll build a 2-way polarity (positive, negative, or neutral) classification system for stock prices based on the news headlines without using NLTK’s (Natural Language Toolkit) in-built sentiment analysis engine. We will use a Logistic Regression classifier, Bag-of-Words features, Random Forest classifier, and Multinomial Naive Bayes classifier. To see which performs best, we will also create our preprocessing module to handle raw news headlines.
Data Used
- The news headlines dataset has 4,100 rows and 27 columns
- Each row has the date of headlines, label (0 – Stock price goes down or stays the same, 1 – Stock price goes up), and top 25 headlines.
- Positive stock sentiment is slightly higher in the given dataset
data:image/s3,"s3://crabby-images/47a8c/47a8c3214af0a83e2fafe8b58c11d38d0f5e8a6a" alt="Sentiment Analysis Using Python Stock Sentiments Graph"
Data Cleansing and Preprocessing
The first thing that we’ll do is to clean the data by filtering out null values and then preprocessing the headlines so that they’re easier to deal with and ready for feature extraction and training by the classifiers.
After cleaning the data, we’re going to split the data into train and test sets and extract the labels from both sets.
data:image/s3,"s3://crabby-images/1f49f/1f49f192ab4fae0c8855aef8f9b103172c03d1fe" alt="Sentiment Analysis Using Python Train and Test Sets"
Split the data sets to extract Y labels.
data:image/s3,"s3://crabby-images/c6bf2/c6bf27b65a984f8a55817dc1923273235c1e29e8" alt="Sentiment Analysis Using Python Split Data"
Then for the preprocessing, we’ll first install NLTK libraries to process the data sets.
data:image/s3,"s3://crabby-images/c22f5/c22f5a5db6d5d11a5b5ce7a9d756f714d549f9ec" alt="Sentiment Analysis Using Python Install NLTK Libraries"
Next, we remove special characters and punctuation marks, rename columns and make all words lowercase. Finally, we combine all the columns into one.
data:image/s3,"s3://crabby-images/beb27/beb27803f00ea475d1ea4435d5bc2cdb40ac441e" alt="Sentiment Analysis Using Python Remove Special Characters"
After the above steps, the final output will be as follows:
data:image/s3,"s3://crabby-images/e6147/e61477c32a7a55838975b411c939cf19f58411d0" alt="Sentiment Analysis Using Python Final Output"
Now, we create a word corpus by tokenizing the preprocessed headlines, removing stop words and joining the stemmed words.
data:image/s3,"s3://crabby-images/1ee64/1ee6467703feac6b950e8a88f7784c76a8715b4b" alt="Sentiment Analysis Using Python Create Corpus for Test Data Set"
Similarly, we can create a corpus for the test data set as well.
data:image/s3,"s3://crabby-images/8e90e/8e90e4ee89251607a16e9baa30182d0911a1ecdc" alt="Sentiment Analysis Using Python Create Word Corpus"
From this corpus, we can create a word cloud of down words (negative impact on stock market) and up words (positive impact).
data:image/s3,"s3://crabby-images/e2673/e2673b41c428567dff61d3931f5e91c26c073fe0" alt="Sentiment Analysis Using Python Word Cloud 1"
data:image/s3,"s3://crabby-images/b4003/b4003a0c7127b36c46ff4527137661194911a364" alt="Sentiment Analysis Using Python Word Cloud 2"
With the available corpus, we extract features by using a bag of words model that can be used with machine learning algorithms to calculate scores.
data:image/s3,"s3://crabby-images/ab2d0/ab2d0277bae6fcc02980592e7e8795a0fcc658dc" alt="Sentiment Analysis Using Python Bag of Words"
Model Building
Now, we will run our data through a Logistic Regression, Random Forest classifier, and Multinomial Naive Bayes classifier. We also calculate the performance measures like accuracy, precision, and recall for the corresponding algorithms to select a model with the best results.
Logistic Regression
data:image/s3,"s3://crabby-images/68851/688516eb1fa9241b9ad5581e6dd725988372b4c6" alt="Sentiment Analysis Using Python Random Forest Classification"
Confusion matrix is a performance measurement for classification algorithms. It is a matrix that gives a visual representation of True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
- Accuracy (all correct / all) = TP + TN / TP + TN + FP + FN
- Precision (true positives / predicted positives) = TP / TP + FP
- Recall (true positives / all actual positives) = TP / TP + FN
data:image/s3,"s3://crabby-images/1f016/1f01655b50a485d2edb4a3f6dc10dad92dba0add" alt="Sentiment Analysis Using Python Making the Confusion Matrix 1"
data:image/s3,"s3://crabby-images/abf83/abf83794ce403c773b15cc6f3b11fa6a988bd954" alt="Sentiment Analysis Using Python Confusion Matrix for Logistic Regression Algorithm"
Logistic regression gave us an accuracy of 86% which is a great start. We continue to use other algorithms to see if they give better results.
Random Forest Classification
data:image/s3,"s3://crabby-images/3bbce/3bbce5455807a4e9674b8daab8201b61dc75a3d7" alt="Sentiment Analysis Using Python Random Forest Classification 1"
Create a confusion matrix similar to logistic regression.
data:image/s3,"s3://crabby-images/2addb/2addbf19191f7fa9963ebe552581ccda745f1209" alt="Sentiment Analysis Using Python Making the Confusion Matrix"
data:image/s3,"s3://crabby-images/75e2b/75e2bc9f8be9f2fb150842fae81f789c4fcd8e8a" alt="Sentiment Analysis Using Python Confusion Matrix for Random Forest Algorithm"
Random Forest classifier gave an accuracy score of 84% which is less than the Logistic regression.
Multinomial Naive Bayes
data:image/s3,"s3://crabby-images/80d27/80d2700de3c1c496f1f9ad698319170e08f44789" alt="Sentiment Analysis Using Python Random Forest Classification 2"
data:image/s3,"s3://crabby-images/c7ca7/c7ca79ff8e7b1cb59c846facba3e5ab7db3a1444" alt="Sentiment Analysis Using Python Making the Confusion Matrix 3"
data:image/s3,"s3://crabby-images/ccc05/ccc05db47788bbc203e42befb231f87f728b0b6f" alt="Sentiment Analysis Using Python Confusion Matrix for Multinomial Naive Bayes"
To recap what just happened, we created a Logistic regression classifier and its Confusion Matrix. We also created a Random Forest classifier and Naive Bayes classifier to see which gives the maximum accuracy.
The table below lists the different classifiers and their accuracy scores.
Classifier | Accuracy Score |
Logistic Regression classifier | 86% |
Random Forest classifier | 84% |
Naive Bayes classifier | 84% |
As we can see, the Logistic classifier is better than the other classifiers. So, we use the LR classifier for the predictions.
Predictions
We can build a function to put all the steps in one place and test some of the headlines to see if the stock price will go up or down.
data:image/s3,"s3://crabby-images/9d722/9d722ff62e16c22abceebe5723b287e7773758ef" alt="Sentiment Analysis Using Python Predictions Function"
data:image/s3,"s3://crabby-images/21a5b/21a5bdf1ed2b42d881ccaecec60b7dda961012ee" alt="Sentiment Analysis Using Python Generating Random Integer"
Example 1:
data:image/s3,"s3://crabby-images/8c336/8c3369864a1a609827de248c8528b60922b87eaa" alt="Sentiment Analysis Using Python Predicting Values 1"
Example 2:
data:image/s3,"s3://crabby-images/e84ea/e84ea862e53ab0cf5630d43c73709fde325187b9" alt="Sentiment Analysis Using Python Predicting Values 2"
Drawing insights from social media posts and other sources is imperative for businesses in this information age because an abundance of information is generated in mere fractions of seconds on the Internet. In this blog, we have covered what Sentiment Analysis is, and how we can analyze given data using Python. However, these were just some basic ways to perform sentiment analysis. We can explore more models to use on our data.
If you have any questions about this blog or need help with sentiment analysis and other machine learning services, please contact us.