Inspiration
Social media is one of the fastest ways people express emotions, and emojis play a huge role in that communication. We were interested in understanding how text and emojis relate—specifically, whether we could predict emojis from tweets and use that to better understand user sentiment. This project was inspired by the idea that brands and platforms could use emoji patterns to gain real-time insights into customer emotions and engagement.
What it does
Tweets and Emojis is a machine learning pipeline that predicts emojis from tweet text and analyzes sentiment trends across brands. It processes raw tweet data, cleans and transforms the text, extracts meaningful features, and applies multiple machine learning models to classify emojis. In addition, the project visualizes patterns such as brand sentiment, emoji distribution, and emotional trends over time, helping uncover how users express feelings toward different brands.
How we built it
We built the project in Python using a structured pipeline with three main stages: data preprocessing, visualization, and modeling. We cleaned the tweet data by removing noise such as links, mentions, and punctuation, then applied tokenization and feature engineering (TF-IDF, sentiment scores, and custom features like craving indicators). We experimented with multiple models including Naive Bayes, Logistic Regression, SVM, and XGBoost. Finally, we created visualizations such as confusion matrices, sentiment scatter plots, and heatmaps to analyze model performance and uncover insights.
Challenges we ran into
One of the biggest challenges was the ambiguity of emojis. Many emojis share similar meanings, making it difficult for models to distinguish between them. Additionally, Twitter data is very noisy, with slang, abbreviations, and sarcasm, which reduced model accuracy. Another challenge was feature sparsity—TF-IDF creates very high-dimensional data that is difficult for some models, especially tree-based ones like XGBoost, to handle effectively. Balancing classes and improving generalization across all emojis was also a key difficulty.
Accomplishments that we're proud of
We successfully built an end-to-end machine learning pipeline that processes raw text into meaningful predictions and insights. We compared multiple models and clearly identified their strengths and limitations. Our visualizations effectively highlighted trends in sentiment and brand perception. We also gained valuable insights into why certain models perform better than others on text data, which strengthened both our technical and analytical understanding.
What we learned
We learned that text classification is more complex than it appears, especially when dealing with subtle emotional signals like emojis. Linear models such as SVM often perform better on sparse text data than more complex models like XGBoost. We also learned the importance of feature quality—stronger features often matter more than more complex models. Additionally, we saw how noisy real-world data can significantly impact performance and why preprocessing is critical.
What's next for Tweets and Emojis
Moving forward, we would improve the model by using deep learning approaches such as transformer-based models (e.g., BERT) to better capture context and meaning in text. We would also expand the dataset and improve class balance to increase accuracy. Another direction is building a real-time dashboard that tracks brand sentiment and emoji usage live from social media. Ultimately, we want to turn this into a tool that helps companies better understand customer emotions and trends at scale.
Built With
- deepnote
- omni
- python
- scikit-learn
Log in or sign up for Devpost to join the conversation.