DocPDFGen for Python

Inspiration: Efficiency and Accessibility

The universal experience of coming across large code bases that lack clear and concise documentation was the driving force behind our HackIllinois creation, the DocPDFGen. Rather than spending hours upon hours wading through code just to comprehend its basic structure, we wanted to leverage the ability of machine learning models in processing mass corpuses of information to significantly cut down this time.

Though there exists processes that can generate documentation for individual functions, there has yet to be an open source program that 1) can operate across multiple platforms and 2) can fully scale a code base to create a comprehensive report. With all this in mind, the DocPDFGen combines accessibility, a need for efficiency, and a proof of concept approach of using specialized LLMs on entire code bases all into one program.

What We Learned

As the members of our team come from various backgrounds, we all learned much from each other and from the process of creating DocPDFGen. After conducting extensive research, we became familiar with loading and utilizing various Hugging Face ML models and using the ReportLab library for generating PDF reports.

We also learned how to utilize the Vue framework to develop a simple, reactive web application that was nested inside Electron, which allowed us to produce a portable application that can be used on Windows, MacOS, and Linux.

Our Build

The core of the backend lies with a pre-trained CodeTrans model where we pass it tokenized Python code strings to produce documentation. For additional flexibility, this model can be exchanged for other programming languages or increasingly specialized models in the future.

Our Vue app is modeled to read user input, and make calls to Electron for accessing the file explorer, tied together with the HTML and CSS formatting for visual appeal. Electron is what allows this web app to become a native application, and allows us to use an external python script to run our model. Vite is our build system, piecing everything together.

Challenges We Faced

Due to challenges created by limited resources and the time constraint, we elected to use a pre-trained model that focused specifically on Python. While the CodeTrans model was trained on a narrower domain than the purposes we used it for, when running on entire files, the model was still able to provide mostly comprehensible documentation.

Additionally, though DocPDFGen may not necessarily replace manual documentation, the report this program generates still provides a concise summary of a code base’s functionalities and serves as a convenient starting point for developers.

Finally, in our pursuit of creating an accessible and streamlined experience, we ran into the typical challenges of formatting the report PDF and user interface to work smoothly across various devices. Given the time and resources, we would like to further expand the languages and functionalities that this program covers.

For front end, the most difficult aspect was piecing together the different frameworks. Some specific challenges were combining Vue components with calls to Electron to access filesystem paths, then using those paths as arguments in the python script to read the original codebase and produce the documentation.