A Feature Extraction based Improved Sentiment Analysis on Apache Spark for Real-time Twitter Data

Piyush Kanungo; Hari Singh

doi:10.12694/scpe.v24i4.2343

Authors

Piyush Kanungo CSE and IT Department, Jaypee University of Information Technology, Solan, HP, India
Hari Singh CSE and IT Department, Jaypee University of Information Technology, Solan, HP, India

DOI:

https://doi.org/10.12694/scpe.v24i4.2343

Keywords:

machine learning, Apache Spark, twitter, sentiment analysis, N-gram, TF-IDF

Abstract

This paper aims to improve the accuracy of sentiment analysis on Apache Spark for a real-time general twitter data. A lot of works exist on sentiment analysis on offline or stored twitter data that uses several classification algorithms on relevant features extracted using well-known feature extraction methodologies on pre-processed text data. However, not much works exist for sentiment analysis of real-time twitter data and especially for the generic data on big data processing platforms such as Apache Spark. This paper proposes a real-time sentiment analysis for generic twitter data through Apache Spark using six classification algorithms on N-gram and Term Frequency – Inverse Document Frequency (TF-IDF) feature extraction methodologies on the pre-processed data. An exhaustive comparison is done using Logistic Regression (LR), Multinomial Naive Bayes (MNB), Random Forest Classfier (RFC), Support Vector Machine (SVM), K-Nearest Neighbour (K-NN), and Decision Tree (DT) classification algorithms. It is observed that the trigram feature extraction method performs the best on LR and SVM and the RFC results are also comparable on the considered general tweets data.

A Feature Extraction based Improved Sentiment Analysis on Apache Spark for Real-time Twitter Data

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

announcement

Indexed In

SUBMIT

Metrics

Journal Information