Inspiration
What started off as a potential lobby for sharing music became a psychological and philosophical thought experiment meant to help users find friends and teammates based on similar interests, personalities, and philosophies. Ultimately moving in the direction of scalable data collection based on social interactions and pattern-based sentiment analysis, the main goal of this project is to help players find new friends who they would mesh with.
What it does
Personality Analysis Bot Level 0(PABL0) scrapes chat data from public forums, such as Discord, and utilizes Natural Language Processing models to find positive teammates for you to enjoy your time online with.
How we built it
Scraping data through Discord Chat Explorer (courtesy Tyrrrz - Github), we pulled a massive amount of chat data in the form of HTML using Beautiful Soup 4 to parse through it. The data was unclean, so we used NLTK to tokenize and clean the data, achieved through a combination of tokenization, removing stop words, tagging, ultimately un-tagging before lemmatization for a cleaner result. This data ran through the Naive Bayes model for machine learning to analyze the sentiment of messages. Multiple algorithms were used, however this one provided the most accurate results following human-checking.
The sentiments (general feelings) for each message were initially categorized either positive, neutral, or negative sentiments. Stored in an SQLite database, and displayed using HTMX, Bun, Flask, Elysia, and D3, the data was processed and visualized to output results in a way users can easily understand and relate to.
Challenges we ran into
No experience with parsing and cleaning data, or training a NLP model from scratch. Cleaning the data was the most time consuming challenge of the project. Deciding how to tokenize the data and picking different options for tokenization (individual, combined, etc) to run the data through the model. Cleaning the data to remove special characters and cases of words, emojis, links, attachments, and multiple corner cases. Piping the data from the Python / Flask server from the NLTK algorithms to the web server for proper display and formatting. Finding enough data to categorize chat model efficiently within the time constraints, especially given the social nature of the data we required. Combining and working with multiple languages at the same time, in a short time span.
Accomplishments that we're proud of
Having the NLP working, categorizing and outputting data in a user-friendly way, scalability, and a new experience with big data.
What we learned
Reduce the scope to deliver a better product within time constraint Data visualization, Natural language processing, Data cleaning/trimming, discord chat scraping, discord bot creation.
What's next for PABL0
Containerize everything for ease of use. Scalability - remote hosting, optimization, containerization More complex training data, under more fields (Personality, Philosophy, Music interests, Game choice, etc) Team/Personality matching which comes from more complex training data sets. Experience tailored to data collected (Collectibles, Procedural avatars)
Built With
- beautiful-soup
- bun
- d3.js
- discord
- flask
- hyperscript
- natural-language-processing
- nltk
- pandas
- python
- sqlite
- subprocess
- typescript
Log in or sign up for Devpost to join the conversation.