A new research article from Darie Moldovan, published by PeerJ Computer Science Magazine introduces a novel majority voting framework designed to significantly improve the consistency and reliability of sentiment analysis for online product reviews. This consensus-based approach is poised to create a more dependable foundation for businesses seeking to understand customer preferences and for developing more accurate deep learning models.

The motivation for this research stems from the persistent challenge of accurately interpreting user-generated content. Online reviews are frequently plagued by issues like spam, fake reviews, and an inherent subjectivity in user-assigned star ratings, which often creates a discrepancy with the actual textual sentiment of the review. The research highlights that a user might give a five-star rating but include negative comments about a product’s functionality, which could mislead a standard classification algorithm and decrease the model’s performance. This ambiguity in labeling a training dataset can introduce noise and adversely affect the accuracy of trained sentiment analysis models.
The proposed solution addresses this by aggregating sentiment labels from multiple commercial automated sentiment tools—specifically, Google Cloud Natural Language API, Amazon Comprehend, IBM Watson NLU, and Azure AI Text Analytics—alongside the lexicon-based VADER tool. By applying a majority decision rule, the system resolves discrepancies in sentiment classification. Crucially, the review’s original 1-to-5 star rating is integrated as a tie-breaker when the automated classifiers yield conflicting results. The system further refines the training data by excluding reviews where the resulting aggregated sentiment label and the rescaled star rating differ significantly, thereby filtering out potentially unreliable reviews that may contain sarcasm or intentional misalignment.
The methodology was applied to label three different-sized datasets of product reviews from the Flipkart e-commerce platform. A key finding was the demonstration of moderate agreement among the five different sentiment analysis tools, quantified by a Krippendorff’s alpha ($\alpha$) value of $0.73$, which suggests notable discrepancies in their classifications. In fact, approximately 27% of the reviews in one dataset showed a degree of disagreement between the tools. The aggregated sentiment, generated by the majority voting process, showed a higher correlation with the user-assigned rating than any of the individual tools alone, validating the framework’s ability to create a more consistent pseudo-ground truth.
Deep learning models, including RNN, GRU, LSTM, and BERT architectures, were then trained on the labeled data. The findings indicate that the models trained with the majority voting labels achieved a high level of performance, with the BERT architecture obtaining an accuracy as high as 97.5% on the largest training dataset. This performance is competitive and, in some cases, superior to other studies using the same data, confirming the method’s effectiveness. The researchers also observed that increasing the training dataset size from 50,000 to 100,000 reviews did not substantially change the correlation with the reference models, suggesting that reliable models can be built with relatively small, high-quality labeled datasets. This is a significant advantage, as it can reduce the high computational and financial costs associated with running commercial automatic sentiment classifiers on massive datasets.
While acknowledging that some disparities persist, particularly when classifying neutral/mixed reviews, the study concludes that the majority voting mechanism is a robust and reliable method for mitigating labeling ambiguities. The enhanced consistency and trustworthiness of the sentiment analysis outcomes could ultimately lead to substantial long-term savings compared to relying solely on less reliable automated tools.
Why the Trustworthiness of Online Reviews Is Under Siege
In the cutthroat world of e-commerce, customer reviews are the reigning authority, driving consumer decisions and shaping brand reputations. However, according to recent academic analysis, the credibility of this massive volume of electronic word of mouth (eWOM) is facing significant challenges. This study highlights that the very foundation of sentiment analysis—the process businesses use to interpret these reviews—is often compromised by subjectivity, inconsistency, and outright manipulation.
The Flawed Foundation of Feedback
It’s a common marketing adage that “there is no such thing as bad publicity”, but research shows the impact of a review is nuanced, depending on factors like brand recognition and product familiarity. More fundamentally, the trustworthiness of online opinions is questionable because user-generated content often lacks regulation, leading to issues like spam and fake reviews.
The core issue, as pointed out in the analysis, lies in the frequent discrepancy between a user’s numerical star rating and the actual textual sentiment expressed in the review. For instance, a user might assign a perfect rating while simultaneously expressing a negative sentiment about a missing product functionality. Conversely, a flawless rating could also be a sign of fraudulent activity if the review text is minimal and unhelpful to other potential buyers. Without grasping the user’s true intent, relying solely on star ratings for sentiment inference is difficult and potentially misleading.
Adding to the complexity, the phenomenon of emotional contagion—where individuals feel the same emotions as those they observe—underscores the need for precise sentiment analysis, which is critical for market research and sentiment-based decision-making. The ability to accurately capture consumer feelings has moved from a desire to a necessity.
The Limitations of Sentiment Tools
To tackle the immense volume of text data, researchers and businesses rely on automated methods. The study reviews the major categories of sentiment analysis techniques and pinpoints their inherent weaknesses:
- Supervised Learning (like Naïve Bayes and SVM) is generally robust but requires high-quality labeled datasets. Creating these datasets is resource-intensive and prone to human subjectivity, limiting scalability.
- Unsupervised and Lexicon-Based Approaches (like VADER) mitigate the need for large labeled datasets. However, they often struggle with linguistic complexities such as context sensitivity and polysemy, where the meaning of a word changes based on usage. Capturing nuances like sarcasm is particularly challenging for these models.
- Deep Learning (including RNN, LSTM, and BERT) excels at capturing contextual and syntactic relationships in text. But these advanced models come with a high computational cost.
Despite the rapid expansion of these methods, a fundamental challenge persists: virtually all techniques struggle to ensure alignment between the rating system and the content of customer reviews. This inconsistency in classification, especially for “neutral” or “mixed” sentiments, can introduce noise into a training dataset, ultimately hindering the performance and accuracy of any subsequent model.
As businesses increasingly rely on these automated insights to gauge customer experiences and monitor brand reputation, the need for a system that can reliably validate and reconcile conflicting sentiment labels has become paramount.



