Redcommer - Reddit Comment Merger and Visualizer

Example

Inspiration

We started with the use of GPT in mind given the inspiration of the application with streamlit, we discussed about various approaches aiming a comedic product, which eventually snowballed into the idea of using datasets to generate a funny and related output, - this eventually snowballed as we considered various datasets of which we eventually settled on reddit due to it's presence in internet history. Taking into account various methods of utilising this, we engaged in the eventual idea of transforming reddit comments/summaries of comments into images.

What it does

This project takes in a large dataset from ClickHouse, focusing on reddit comments in subreddits from 2011 December. Using this data we have a front end API which allows for the selection of subreddits and randomisation of said selection. The random selection of subreddit comments is then fed to ChatGPT where it is summarised and "formatted" to become a prompt for text-to-image generation.

How we built it

This project had a decent techstack ranging from a MongoDB Atlas Database/Store to two web servers and the integration of essentially 3 different APIs. System Data Flow Diagram

Streamlit (Front-end)

We used streamlit to devise a front end, and make API requests to a custom built API server, streamlit was devised to handle the change of state in the program and update upon query response.

Reddit Dataset Server (Back-end)

This data server is the heart of the operation as, Streamlit essentially acts only as a GUI, this required a thorough debugging process using API debugging tools such as PostMan, and also the debugging of 3 seperate APIs which a handler had to be designed for each.

MongoDB (DataBase)

The MongoDB database was an interesting addition which added a variety of extra challenges in attempts to access, as well as data ingestion, the MongoDB contained a specific selection of comments from the dataset. This required a sanitisation process due to the various amounts of data provided.

Challenges we ran into

We ran into a variety of complications running into this project, one of which was the significant amount of data that would be processed of which we have to perform pre-processing removing irrelevant data, of which we reducing data from about 15 million comments to approximately 2 million.

As well as this we encountered various API issues such as MongoDB of which was eventually resolved by thorough documentation reading as we learnt the different query types, as well as this we encountered other API issues such as the

Accomplishments that we're proud of

Our team had a large skill discrepancy, where all but one of our team member had no experience in using the tools we used and truely working within a computer science team. We are proud that everyone was able to quickly adapt and learn the tools enough to apply them in the project, with the help of the experienced programmers concise help.

What we learned

Overall we learnt a variety of techniques focusing on API Development (using postman for debugging) and handling, such as the various handlers we designed and also how to implement and research into new technologies in a fast time period, of which majority of us have never touched before. As well as this we also learnt to work cooperatively as a team, where there was a significant amount of knowledge sharing.

What's next for Redcommer - Reddit Comment Merger and Visualizer

We hope to replace Dall-E with stable diffusion as the model is a lot more resolute at portraying text in image form. We also hope to gather more data too as a wider pool of data would allow us to start comparing trends through the ages We would also want to optimise the prompt better as chatGPT frequently misunderstands what it is supposed to do and can give responses that hurt image generation. E.g creating a story.