Inspiration

Our inspiration stems from the vast amount of knowledge locked behind paywalls and the challenges of extracting meaningful insights from lengthy academic texts. We wanted to create a tool that democratizes access to knowledge by allowing users to effortlessly extract and visualize the core concepts and subconcepts from any book while not missing the essentials. The tool aims to provide a bird's-eye view of complex texts, making it easier for students, researchers, and lifelong learners to grasp the essence of any document without getting lost in the details.

What it does

Phakts breaks books down into interactive nodes and subnodes representing the main concepts and their relationships. By simply clicking on a concept, users can view the most relevant insights in a sidebar, which allows them to quickly absorb the most important information without needing to read the entire document. Users can also view relevant video content snippets to the concept they're exploring in the sidebar.

How we built it

We built Phakts using a combination of Flask for the backend, LibGen to find PDFs behind paywalls, PyMuPDF for PDF text extraction, ZeroEntropy's API for concept extraction, and the Perplexity API for video content. The Flask application handles user requests, fetches documents from various sources, and processes them using PyMuPDF to extract text. This text is then chunked and sent to ZeroEntropy's API, which identifies the main concepts and subconcepts. The extracted data is organized into a hierarchical structure and visualized using a custom frontend interface.

Challenges we ran into

One of the biggest challenges we faced was ensuring that the system could handle the retrieval of long-form documents while maintaining performance and accuracy in concept extraction. The sheer size of documents like 300-page PDFs required efficient chunking and processing to ensure handling queries effectively. Additionally, all the available APIs to pull LibGen data were not functioning, so we had to write out own web scraping script to prevent downloading poisoned files.

Accomplishments that we're proud of

We’re super proud of successfully building a system that can autonomously find, retrieve, and process academic content, making it accessible to users in a completely new way. The ability to break down dense documents into interactive, easily navigable nodes and subnodes is a major accomplishment. We’ve also integrated ZeroEntropy’s API (a pretty recent startup) to extract meaningful concepts, which provides valuable insights into long-form content with minimal user effort. Finally, our system’s ability to dynamically generate a knowledge graph based on a user’s query, and marrying text and video content into a cohesive UI is something we're really proud of.

What we learned

We learned a lot about how to effectively break down complex academic texts and make them digestible at scale. Specifically, we had to build a robust pipeline to process large PDFs while keeping performance intact. This was really hard to do for PDFs at scale. ZeroEntropy was awesome for concept extraction, but we had to refine the way we chunked documents to optimize performance. Writing our own web scraper for LibGen taught us how to handle external dependencies and avoid potential pitfalls like bad downloads or broken links.

What's next for Phakts

We want to optimize the concept extraction pipeline further - and focus on enhancing the chunking logic so we can handle longer documents more effectively and make the retrieval process even faster. We’re also going to extend support to more file types, so users aren’t limited to PDFs. Also considering ways to make Phakts aware of the user's interests over time so it can personalize its relevant text and video snippet extraction.

Built With

Share this project:

Updates