ML Olympiad - Hausa Sentiment Analysis

  • Technologies: Python
  • Project date: 09 March, 2022

Customer

The challenge was part of the ML Olympiad, An associated Kaggle Community Competitions hosted by ML GDEs or TFUGs, sponsored by Google Developers.

Challenge

The objective of this challenge was to develop a multi-class classification model to classify Hausa news content according to its specific category. Given a sentence, the task is to classify whether the sentence is of positive (1), negative (-1) or neutral (0) sentiment. For messages conveying more than one sentiment, whichever is the stronger sentiment should be chosen. Predict if the text would be considered positive, neutral or negative (for an average user).

Solution

I developed a multi-class classification model to categorize Hausa news content based on sentiment. The model takes a Hausa sentence as input and predicts its sentiment as positive (1), negative (-1), or neutral (0). The model was designed to identify the dominant sentiment, even if the sentence contains elements of both positive and negative emotions. In essence, the model predicts how an average user would perceive the overall sentiment of the text.

  • Data Acquisition: Gathered a large corpus of Hausa news articles labeled with sentiment categories (positive, neutral, negative).
  • Data Preprocessing
    • Clean text data: Remove special characters, noise, and stop words.
    • Stemming/Lemmatization: Reduce words to root form for better generalizability.
    • Tokenization: Split text into individual words or sub-word units (depending on model).
  • Model Selection and Training
    • Model Choice: After exploring multi-class classification models for text data. XGBClassifier was selected.
    • Training and Hyperparameter Tuning
  • Model Improvement and Evaluation
    • Analyze errors on the testing set to identify challenging text types.
    • Combined multiple models to improve performance and reduce overfitting.
  • Deployment
    • Integrated the trained model into an API for sentiment classification.
    • Combined multiple models to improve performance and reduce overfitting.

Results

My experimentation with a XGBClassifier for multi-class sentiment classification of Hausa news content yielded promising results. The model achieved an accuracy of over 82% in classifying news articles as positive, neutral, or negative, demonstrating its effectiveness in real-time sentiment analysis of written news. The XGBClassifier network effectively captured long-range dependencies within sentences, leading to a robust understanding of sentiment in Hausa news articles. The model achieved high accuracy in classifying positive and negative sentiments, exceeding 80% precision and recall for both categories. This project's success is further highlighted by our second-place finish in a relevant Kaggle competition. This achievement signifies the model's strong performance compared to other competing approaches.

Technologies and Tools

Pandas, Python, NumPy, Jupyter, seaborn, matplotlib, sklearn.