In a study of “Evaluation of classification algorithms for banking customer behavior under Apache Spark data processing system” in 2017

In a study of “Evaluation of classification algorithms for banking customer behavior under Apache Spark data processing system” in 2017, Etaiwi Wael et al conducted a comparative study on the Naive Bayes algorithm and the SVM algorithm using Apache Spark 1. The results showed that Naive Bayes predictive approach was more efficient than SVM. The data used were the customer’s personal information and behavior data of Santander Bank that could be obtained from the website kaggle.com. In evaluating the classification algorithm, Etaiwi Wael et al used evaluation metrics precision, recall, and f-measure.
Pang et al. (2002) have compared many of the classification algorithms in movie reviews. Pang et al. (2002) gave a vision of insight and comprehension in sentiment analysis and also opinion Mining. Pang et al. (2002) evaluated the performance of Naive Bayes, Maximum Entropy, and Support Vector Machines in the specific domain of movie reviews. The result obtained an accuracy of slightly above 80% 7.
The same techniques were also used in Kharde and Sonawane (2016) to perform sentiment analysis on Twitter data. Again, the results showed that the SVM algorithm proved to have the best performance 8.
Catal and Nangir (2017) proposed a new sentiment classification technique based on Vote ensemble classifier. Three individual classifications used, such as Bagging, Naive Bayes, and Support Vector Machines (SVM), for Turkish sentiment classification problem. The results showed the Naive Bayes algorithm and the SVM algorithm have good results with an average accuracy above 81% 9.
Bo Yan et al in 2017 in a paper entitled “Microblog Sentiment Classification using Parallel SVM in Apache Spark” conducted a study to classify sentiments on a microblog using SVM parallel 10. In addition, they also tried to increase the execution speed of SVM which usually had constraints with considerable data. They increased the speed with RBF Kernel function using Spark. They also increased the value of accuracy by attaching comments to microblogs, feature space evolution, and tuning parameters. With the methods they performed, the performance of the SVM parallel algorithm was increased using Spark SVM compared to LIBSVM.
Srivastava D.K and Bhambhu L., in the journal of theoretical and applied information technology, conducted data classification using support vector machine research 11. In their research, they used 4 types of data that is Diabetes data, Heart data, Satellite data, and Shuttle data with a different number. From the results of their research, the results obtained a comparison between SVM techniques using RBF Kernel function and rule base classifier for RSES. The results obtained from the total execution time to predict, SVM took longer time compared with RSES. While in terms of accuracy, SVM was better compared to RSES and it could be concluded that the greater the amount of data classified, the greater the value of accurate predictions.
Huang Y. and Li Lei in 2011 in their paper entitled “Naive Bayes Classification Algorithm Based on Small Sample Set” studied the Naive Bayes classification algorithm based on the Poisson distribution model 12. They studied it while proving that the classification accuracy obtained remained high despite using small sample data.
The next research is from Baltas Alexandros et al in 2017 entitled “An Apache Spark Implementation for Sentiment Analysis on Twitter Data”. In their research, they used machine learning methodologies with natural language processing techniques, apache spark Machine Learning library (MLlib) and classification algorithm (binary and ternary classification). After doing the classification of microblogging including positive or negative, they used the method in machine learning. The result showed that the Naive Bayes algorithm was the best algorithm 13.
In 2016, there were several studies related to Big Data Apache Spark Machine Learning and Sentiment Analysis using Apache Spark. Fu Jian et al in their research “Spark-A Big Processing Platform for Machine Learning” analyzed Spark’s primary framework, core technologies, and ran a machine learning instance on it 14. Compared with Hadoop, Spark had a better ability of computing.
Salloum Salman et al in 2016 in their research entitled “Big data analytics on Apache Spark” explained how the Apache Spark was built 15. They did technical reviews on the Apache Spark with their focus on key components, abstractions, and features of apache spark. In this case to find out what the Apache Spark had to design and implement Big Data and pipeline algorithms on Machine Learning.
In supporting this research using datasets based on reviews of a product on the Google Play Store, there are also several related studies. In his dissertation “Mobile App Analytics & Sentiment Analysis of Customer Reviews”, Calikus Ece (2015) proposed a system to develop a prototype that displayed a dashboard for IOS and Android app analytics at the same platform. The back-end used mobile data mining techniques and applied classification based sentiment analysis model. As a result of the background research, it was decided to apply supervised machine learning techniques for sentiment classification. Evaluation of the proposed solution has resulted in four major contributions. The first and major contribution of their research was proved that supervised machine learning approach of sentiment analysis was applicable for classification on mobile app reviews with 88.3% accuracy. Moreover, it showed that the best algorithm for sentiment classification in terms of performance was the Multinomial Na├»ve Bayes algorithm 16.