Even as our day-to-day reliance on social media continues to grow, the way in which it evolves has been rather limited. Often, Social Media algorithms build upon the relations that we form as humans, intertwining your own interests with the interests of others. In this way, social media tends to make erroneous predictions of the content you would enjoy without analyzing the content itself. This reliance on other users, tags, and their engagement creates an environment that is an average of the collective interests of the masses instead of catered to the individual. Without the necessary context of the content of individual images or videos posted on social media sites, algorithms don’t have the necessary information to know why certain content created engagement for users. By creating a recommendation engine that contextualizes images, we can further bridge the gap between users and their personalized content.
Our initial idea stemmed from what we saw as a fundamental flaw in social media: loss of individuality. Having come up with a problem, our immediate course of action followed in brainstorming. We came up with various ideas, ranging from heavy text and metadata adaptation to deep image analysis and contextualization. With most media within social media platforms not always relating clearly to the captioned descriptions written by users of various personalities, we realized that a much more surefire manner of media recommendation would rely upon the contents of the media itself. This led us to an approach that analyzes the media interacted with by users, accounts for user interest in a given item, and directly correlates this with dynamically changing groups of similar content.
What it does
Our application uses a custom-built image recommendation engine that relies upon contextualization of images to cater content more personally to its users. To demonstrate the power of our engine in contextualizing and ____ images, our platform even contains the extra quirk of having text-free content. Through a complex ML algorithm and custom recommendation engine combination (elaborated on in significant detail below) our platform gives users the chance to really relate to the content they are presented, giving more of what they explicitly desire. From the moment a user first logs in, they are immediately presented with several options to calibrate our platform to their personal desires. Whether they personally post on our platform or simply enjoy the personalized content uploaded by others, our engine accounts for how to cater towards their personalized interests.
How we built it
Our front-end was developed using a limited set of technologies; all we were intending to do was showcase the engine’s functionality and as such we only needed a simple UI to do the trick. To build the skeleton of the web-app, we used React coupled with Material UI and Styled Components. React is a development library used to create dynamic, reactive, web applications and the two mentioned libraries assist in providing pre-built UI blocks and easier styling, respectively. The use of these libraries made throwing together a highly adaptable front-end simple; given the modularity of these libraries, changing a part of one view didn’t impact the others. Further accentuating modularity, we used Redux as a global state management source. Though Redux is often used to handle all application states, we limited the scope to store global variables like user uid for quick access. This had the benefit of reducing links across components and added to our quick modularity. Lastly, we implemented the client-side Firebase SDK for simple transactions with the database. As we got more confident, we moved most of our Firebase interactions to an API, but a few extremely simple queries (like gathering a user’s posts), were aided by the library. In summary, our use of React, Material UI, and Styled Components aided in modular design. Our use of Redux added simple-to-use data. And our use of Firebase SDK made small interfacing with the backend a breeze.
Our back-end was developed as a locally hosted Flask server openly accessible to connection via LAN. The Flask server served as an API for the website to interact with the major component of our recommendation engine, which used embeddings generated by OpenAI’s CLIP to create intelligent similarity searches across text and images. We used a pre-trained version of clip that uses a Vision Transformer to create the image embeddings. Using these embeddings, we developed a computationally efficient implementation of the K-Nearest Neighbor algorithm that stores a distance matrix of all analyzed images to make inference much faster for pre-analyzed images. However, while the CLIP-powered embeddings do enable us to score the similarity of images based on their content, we also implemented other features into our recommendation system to make it better capture the tastes and interests of the user in real time. The feed is generated dynamically based on which images the user spends the most time on, allowing us to gauge what kind of content keeps them engaged; this pairs very well with the CLIP embeddings, which allows us to provide the user with content that is semantically similar to what they have been shown to enjoy through the amount of time they spend on each image.
Challenges we ran into
Throughout development, we ran into challenges on each layer of the application (both the frontend and the backend). Most glaringly, visualizing the recommendation engine in a “social-media-esque” manner was quite difficult. Modern social media platforms use textual and graphical analysis to recommend related material; however, as mentioned before, our recommendation engine takes a contextual approach and analyzes images based on their content using CLIP (OpenAI). Developing this engine was a challenge of itself, but taking the end result and handing images to the front-end procedurally proved to be another challenge. To tackle image recommendation in a UX oriented way, we had to develop an algorithm that weighed images based on engagement and chose the next best recommendation from that. Given that capturing time-spent on images is a front-end task, our project required a tight coupling between our front-end and back-end developers: they had to work together to build out the links. In the end, we managed to fix the glaring issues and get our MVP running, but it simply wouldn’t have been possible without tackling the aforementioned challenge head-on.
Accomplishments that we're proud of
This project proved ambitious as both a front-end challenge and a back-end challenge. As such, we were proud of not only solving these challenges independently but also merging them into a complete project.
To break this down, we covered previously how our backend worked: we used a model to contextualize images, then compared their contextual data to other images, and finally returned a set of most-closely related images. Although we tried using scikit-learn at the beginning to train and use K-Nearest-Neighbor, we found that the library was too limited and inefficient because of the way we wanted to use it. Thus, we created our own implementation of the algorithm to deliver quick inferences. On the front-end, we: grabbed similar images, created a weighting system to decide how to prioritize new images, allowed for uploading/viewing of a specific user’s images, and dynamically adjusted the content in real time for each user. Achieving just one of these was an accomplishment we all recognized.
Independently, these could have represented two projects: a contextual image recommendation engine and a social media platform. However, to merge them required careful planning and ended up being our most pride-warranting accomplishment yet: we created an engine alongside a visual representation of it in a familiar setting.
On an abstract level, we’re proud of what our engine has accomplished. Our choice of showcasing a social media platform was specific: we wanted to demonstrate that we could compete with AND do better than other recommendation engines through our approach alone. While other engines use captions and graph theory to build recommendations off previous posts, we do what other engines haven’t done before: we process the literal content of images through ML and do so without disrupting the UX. Though there are various kinks to work out, we’re ecstatic that our platform is able to deliver recommendations at the level it is now.
What we learned
We learned a lot about the math and statistics that go into machine learning algorithms in order to understand how CLIP works. Even after getting it running and testing it, we were continually surprised by how human-like its evaluations seemed to be. Beyond conceptual learning, we learned a lot in our journey to implement and execute our ideas. Obviously, we have a very minimal product that is more of a proof-of-concept than a fully finished app, and we learned many lessons and concepts about front-end and back-end design patterns as well as general computer science intuition.
What's next for Cassia
In the future, the Cassia recommendation algorithm can be expanded to further categorize different images in order to provide more relevant information regarding user interest. Along with this, the engine can also give different images multiple detailed descriptors to provide additional depth. Also, Cassia can place images into interconnected webs of interest that can more easily provide the most similar links to content when a user engages with certain content. Cassia can be expanded to analyze short-form videos and audio to further contextualize specific posts and better define user engagement trends.