PREDICTING THE CRICKET MATCH OUTCOME USING CROWD OPINION ON SOCIAL NETWORKS: A COMPARATIVE STUDY OF MACHINE LEARNINGWith the emergence of internet people have started using Twitter, Facebook, YouTube and Instagram to share information, ideas, and opinions about global events. Tweeter offers wisdom of crowd concept which is used by researchers to predict the results of different sports using public tweets. Cricket is so popular that 1.
5 million people followed Indian Premier League 2014 and posted five million tweets. Cricket World Cup has even bigger fan crowd. The compound rules and different natural conditions make it difficult to predict the match results. So opinion mining and sentiment analysis are used to predict results.
We collected 109 match tweets written in English and extracted three features including volume of tweets, sentiments aggregation and predictions about scores. Lastly three classifiers Support Vector Machine, Naive Bayes and Logistic Regression are used to estimate the results. Researchers around the world have already done the prediction using data from social media. They used different techniques, regression models, statistical information, classifiers, natural conditions affecting match results, features of players and different algorithms to predict the match outcomes.
In contrast with these techniques we extracted meaningful features from the twitter data to predict the results. We also compare SVM, NB and LR classifier results to choose the suitable classifier.The procedure is divided into training and testing phase. In training phase first we collected 2.3 million tweets for CWC 2015 and 1.7 million tweets for IPL 2014.
Then we extracted three different features from these tweets: twitter volume, aggregated fans’ sentiment and score prediction. Twitter volume is good for team ranking and to find this feature we divided the tweets to the total number of tweets for the match played by that team. Fans’ sentiment from twitter data has been effectively used to predict the outcomes in different fields of life. We made positive and negative classes of this linguistics data and divided positive number of tweets to the total number of tweets before the match. To find the score prediction we divided total numbers of tweets containing the predicted score to the matches played by the team.
After calculating these features, we used SVM, NB and LR classifiers to evaluate the effectiveness of these features. SVM mainly focuses on finding a hypothesis to minimize the true error. NB classifier is used to evaluate the expected suitable classes in documents. LR is used to predict events on the basis of known facts using statistical analysis. We calculated three hypotheses from these classifiers. In evaluation task, we used recall, precision and f-measures to evaluate these three hypothesis.
Testing was performed using WEKA tool as it has many advances and evaluation methods to perform. We collected 109 matches statistics of IPL 2014 and CWC 2015 using Twitter API and PHP Doom Parser. After testing, results showed that SVM gave 0.893 precision, 0.880 recall and 0.
877 f-measure. NB gave precision of 0.876, 0.870 recall and 0.
869 f-measure. LR gave 0.867 precision, 0.863 recall and 0.
862 f-measure. These figures showed that SVM has verifiable edge over the other two classifiers because it is not affected by the less number of examples to test. NB is second best classifier as its calculations are coherent. But it has a problem of having features with no numerical values and has very subtle irregularity. LR has one advantage that it can calculate the probability which can show the confidence of the results.
We used bookmaker prediction method to check accuracy of our proposed methodology. We inspected that, is there any profit we produced during the betting before match. It was realized that with our proposed methodology can better predict the results than bookmaker prediction with the 67% profit. Lastly, to find the best classifier we used paired t-test in WEKA tool. We took the data of 9 top teams in CWC 2015.
In this test SVM scored 87.90% accuracy, NB 86.28% and LR scored 85.
73% accuracy. These results interpreted that there is no significant difference in these three classifiers.