Arabic Sentiment Classifier

Inspiration

The App Reviews Sentiment Classifier project began as a university assignment, but I quickly saw an opportunity to expand it beyond its original scope. I was fascinated by the challenge of handling Arabic dialects, particularly focusing on Saudi and Egyptian accents, which often differ in subtle yet important ways. Sentiment analysis itself isn't a particularly groundbreaking application, but what intrigued me most was testing how accurately I could classify sentiment in this complex language and experimenting with various models and techniques to improve the results. I wanted to push the boundaries of what could be done with Arabic-language sentiment analysis, not just to solve a specific problem but to explore how well machine learning models could perform on a task that hadn’t been tackled with the same intensity in Arabic. Expanding the dataset and focusing on dialects gave me a chance to dive deep into the intricacies of the language and test the power of different models, such as TF-IDF, Word2Vec, and FastText. My goal was less about building a practical tool and more about challenging myself to achieve the highest possible accuracy and learning along the way.

What it does

The App Reviews Sentiment Classifier analyzes Arabic-language app reviews and classifies them as either positive or negative.

How we built it

This project combines a Flask backend for sentiment analysis with a Flutter frontend for a mobile-friendly interface. I used scikit-learn for KNN classification and PCA for dimensionality reduction. The backend processes Arabic text by removing diacritics, normalizing characters, and handling negation detection. To enhance the model, I experimented with different vectorization techniques (TF-IDF, Word2Vec, FastText), tested them for accuracy, and built a solution tailored to the Arabic language.

Challenges we ran into

One of the primary challenges was the lack of data for other Arabic dialects, such as Algerian and Moroccan. The limited availability of such data made it difficult to build a robust model for these regions. Another issue was that the steps for processing Arabic text aren't directly comparable to those in English, requiring me to find creative workarounds for challenges like tokenization, normalization, and negation handling. I also had to skip some traditional steps, like stemming, as they didn’t provide useful results in the context of Arabic.

Accomplishments that we're proud of

I’m happy with the accuracy I achieved when classifying Arabic reviews, particularly in the Saudi and Egyptian dialects. Despite the challenges with dialect variation and language complexity, the model performed well at predicting sentiment, and the results demonstrated that machine learning models can indeed handle Arabic text at a high level.

What we learned

Through this project, I gained a deeper understanding of the intricacies of processing Arabic text, particularly its dialectical variations. I learned how to preprocess text for sentiment analysis and how different vectorization techniques can affect classification accuracy.

What's next for Arabic Sentiment Classifier

Next, I plan to extend the classifier to support Algerian and Moroccan dialects. The challenge with these dialects lies in their significant linguistic differences, particularly in Algerian Arabic. Here are some key challenges to tackle:

Variations within the dialect: A single idea or sentiment can be expressed differently depending on the region. For example, to say "beautiful," you could find variations like:

"شابة," "هايلة," "مليحة," "مريقلة." These variations make uniform analysis more complex. Regional differences in vocabulary: The vocabulary can vary greatly from one region to another. For instance:

"شابة" is common in the west (Oran), "مزينة" is used in the south, "مليحة" is more common in the east. Linguistic mixing: Algerian reviews often mix Arabic, French, and sometimes Berber. This makes it challenging to identify patterns and vectorize the text effectively.

Use of numbers: Numbers are frequently used in informal writing to replace certain Arabic letters. For example, "7" is used for "ح," and "3" for "ع."

Multiple writing styles for the same phrase: A single review can be written in several different forms, making preprocessing difficult. For example:

"تطبيق مليح" could appear as: "تطبيق مليح" "app mli7a" "applixtion chaba" "application haylaa" "تطبيق شباب" "app mziana" Addressing these challenges will require collecting more data, refining text preprocessing methods, and adjusting vectorization techniques to better handle the specific linguistic features of Algerian and Moroccan Arabic.

Built With

dart
dart-frameworks:-flask-(backend)
flask
flutter
flutter-(frontend)-machine-learning-libraries:-scikit-learn
gensim
numpy
pandas-apis:-flask-rest-api-for-backend-communication-tools:-visual-studio-code
python
sklearn

Updates

Brahim Khattara started this project — Feb 05, 2025 05:56 PM EST

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.