Inspiration
Our principal source of inspiration was Ground News, an online news alternative that lets users compare the coverage of stories across sources to identify bias. When reflecting on how useful we found it, particularly in a highly divided media landscape, we realized that for most people who get their information from the articles at the top of google search results, Youtube videos or online forums, there was no similar way to identify bias to stay better informed. Motivated by a desire to help others get the most accurate information possible, we decided to create Aletheia.
What it does
Aletheia is able to get content from a variety of sources across both text and video. It analyzes this information to identify areas of political bias, which it can then display to users, providing specific examples to ensure they arrive at a balanced understanding of any topic.
How we built it
Aletheia is built with a variety of tools that come together to deliver a functional and relevant product for today's society. In order to scrape information off of websites and analyze their political biases, Aletheia uses Cheerio and Puppeteer, stripping away extraneous html tags and converting them into plaintext for the AI models to interpret. The project is built with Next.js and Tailwind CSS, which help deliver a clean and intuitive user experience. Finally, Google Gemini and TwelveLabs API power Aletheia's source analysis, providing a deep dive into the biases in the content of a webpage or video.
Challenges we ran into
The hardest part of building Aletheia was definitely creating a functional and efficient web scraper capable of accessing information from the greatest variety of sources. Though we started with puppeteer, we quickly realized that Cheerio would not only be simpler and lighter weight, but also give us access to the complete HTML of a page, bypassing most paywalls that would otherwise ruin a program like this. This was only a small part of the problem, however, as many sites we tested featured other stories and heaps of unnecessary information that took up significant context length and confused our model. To amend this, we had to conduct research on common html classes that were not relevant to the main content, like ads and scripts. These classes were then removed from the content, and what remained was filtered into plain text that could easily be analyzed.
This wasn't the end of our problems, however, as while many news sites conformed to the standards we created for Cheerio, others such as Twitter and Reddit had their own unique formatting and blockers that made it impossible to extract text. To deal with this, we had to use Stagehand and Playwright as fallbacks to emulate virtual browsers and parse visible text from there, giving us the ability to determine bias regardless of the site. Unfortunately, Stagehand and Playwright weren't capable of being properly deployed to Vercel serverless functions, forcing us once again to rewrite everything in Puppeteer and use a custom version of chromium built for serverless platforms.
It was also challenging to set up our project with tailwind, as the latest version - v4 - had only just released and therefore had limited documentation. Tailwind v4 completely revamped the tailwind.config.ts file that we were used to using to configure variables and utility styles, and it took a lot of research to learn about the best practice for the new update.
Accomplishments that we're proud of
Instead of delegating the work of removing unnecessary and irrelevant information from the web scraping result to an AI model, which would increase overhead costs, we used Cheerio to strip common html tags that had nothing to do with the content, like classes reserved for ads and other promotions. This allowed us to use 1/5 as many tokens per request and cut down our overall processing time, allowing for a smoother user experience.
What we learned
We were able to learn a lot about webscraping and retrieving information from sites. We had to research a number of technologies such as Playwright, Stagehand, Puppeteer and Cheerio, and ended up implementing a mix of many, giving us a good understanding of a multitude of powerful tools that we can continue to make use of in the future.
We also grew accustomed to the latest version of tailwind (v4) and NextJS which, though initially confusing, provide a far cleaner syntax for applying styles that we will be able to apply to future projects.
Finally, this was a great (yet frustrating!) way to understand more about deploying to services that make use of serverless functions like Vercel. We're now far more aware of the limitations of such deployments, and the experience we gained through debugging gave us an amazing foundation for addressing future problems.
What's next for Aletheia
In the future, we'd love to train our own model custom built to identify bias. Unfortunately, given the range of sources that we wanted to use (news sites, informal videos, forums, etc.) we felt that we did not have enough time to do so at ht6.
We'd also like to make the app completely serverless, running all AI models on the edge to avoid having to use external services that may be unreliable.
Finally, we feel that it would be cool to make Aletheia scrape videos off of links (similar to cobalt.tools) rather than require them to be uploaded by the users, which is a relatively large pain point in the experience.
Built With
- cheerio
- gemini
- next
- puppeteer
- readability
- tailwind
- twelvelabs
Log in or sign up for Devpost to join the conversation.