Inspiration

Open source is the backbone of modern software, but diving into a new repository, especially a large one, can be incredibly daunting for developers of all levels, particularly beginners. Trying to understand the project's purpose, current activity, where to contribute, or how to debug an issue often involves manually digging through READMEs, issues, PRs, and discussions. We saw an opportunity to leverage the power of Large Language Models (LLMs) like Google Gemini to automate the initial analysis and provide contextual summaries, significantly lowering the barrier to understanding and contributing to open-source projects. We wanted to create a tool that acts as an intelligent "co-pilot" for exploring GitHub repositories.

What it does

Contrib is a web application designed to provide AI-powered insights into GitHub repositories. Users can:

  • Search Repositories: Find public GitHub repositories using keywords or direct name searches.
  • Analyze Specific Repositories: Enter the URL of a public GitHub repository.
  • Get Overall Summary: View an AI-generated overview covering the project's likely purpose, target audience, recent development activity focus, and potential future directions based on README content and recent activity titles.
  • Explore Activity Details: Browse recent open issues, pull requests, and discussions fetched from the repository.
  • Summarize Issues & PRs: Click a button on any listed issue or pull request to get a concise AI summary of its content, specifically looking for potential solutions or workarounds mentioned in the comments.
  • Analyze Errors: Paste an error message or describe a problem, and the AI will attempt to correlate it with existing open issues or relevant project dependencies, providing preliminary debugging guidance and pointing towards potential existing discussions.

It essentially acts as an intelligent dashboard for understanding a repository's health, activity, and potential contribution points quickly.

How we built it

Contrib is built with a modern web stack, separating concerns between the backend and frontend: Backend:

  • A Python Flask server acts as the API layer. It handles requests from the frontend.
  • Uses the PyGithub library to interact with the GitHub REST API (v3) for fetching core repository data like files (README), issues, PRs, languages, and basic repo info.
  • Uses the requests library to make calls to the GitHub GraphQL API (v4) for efficiently fetching discussions and dependency graph information.
  • Integrates with the Google Generative AI SDK (google-generativeai) to interact with the Gemini API for all summarization and analysis tasks.
  • API endpoints are designed to serve specific data needs (search, core analysis, on-demand summaries, error analysis).
  • Environment variables (python-dotenv) are used for securely managing API keys.
  • CORS is configured (Flask-CORS) to allow communication with the frontend.

Frontend:

  • A single-page application built with React.
  • Provides the user interface for searching, displaying results, and interacting with the analysis data.
  • Uses React Router (react-router-dom) for handling navigation between the search page and the repository analysis page.
  • Uses Axios for making asynchronous requests to the Flask backend API.
  • Manages application state (search results, analysis data, loading indicators, errors) using React hooks (useState, useEffect, useCallback).
  • Employs React Markdown (react-markdown) with remark-gfm to render the AI-generated Markdown content correctly (summaries, analyses).
  • Components are structured into pages (SearchPage, AnalysisPage) and reusable UI elements (SearchBar, RepoCard, ItemSummary, LoadingSpinner, etc.).

Challenges we ran into

  • GitHub API Rate Limits: Fetching detailed information, especially comments for multiple issues/PRs, quickly runs into GitHub's API rate limits. We had to be strategic about the amount of data fetched (limiting the number of issues, PRs, comments) and plan for caching as future works (using PostgreSQL and Redis). GraphQL helped fetch specific nested data more efficiently in some cases.
  • AI Prompt Engineering: Crafting effective prompts for Gemini was crucial and iterative. We needed summaries that were concise, accurate, and based only on the provided context (README snippets, issue titles/bodies, comments) to avoid hallucination.
  • Context Window Limitations: We need to balance the amount of context (README length, number of issues/comments) sent to the Gemini API against its token limits, required to truncate data appropriately while still providing enough information for meaningful analysis.
  • Asynchronous Operations & State Management: The frontend makes multiple asynchronous calls (fetching core data, then overall summary, then potentially multiple item summaries or error analyses). Managing the various loading and error states gracefully for a smooth user experience required careful state management in React.

Accomplishments that we're proud of

  • Successful Integration of Multiple APIs: Successfully combining the GitHub REST API, GitHub GraphQL API, and the Google Gemini API to provide analysis.
  • Clear Separation of Concerns: Building a robust backend API with Flask and a dynamic frontend with React, creating a scalable and maintainable application structure (a significant step up from a potential single-script approach).
  • Proof of Concept for AI in OSS Exploration: If designed properly (preventing hallucinations, etc.), we find that LLMs can help to aid developers in navigating and understanding the complex landscape of open-source projects.

What we learned

  • Gained deeper experience in leveraging the different facets of the GitHub API (v3 vs v4) and handling their respective strengths, weaknesses, and limitations (especially rate limiting).
  • Learned effective techniques for prompt engineering for specific analytical tasks, managing context windows, interpreting AI responses, and handling safety/blocking mechanisms in the Gemini API.
  • Reinforced understanding of building and connecting a Python backend API to a JavaScript (React) frontend, including handling CORS, data serialization (JSON), and asynchronous communication.
  • Realized the critical importance of clearly defining the boundaries of what an LLM can "know" when analyzing code repositories, emphasizing that metadata analysis is different from code execution analysis.

What's next for Contrib

  • Enhanced Project Discovery: Introduce curated lists, advanced filtering (by topic, language, "beginner-friendliness"), and highlight projects actively seeking contributors or specific skills. Explicitly feature "Good First Issues".
  • Deeper Analysis: Incorporate analysis of CONTRIBUTING.md files, provide automated setup hints based on common configuration files.
  • Learning Hub: Integrate beginner-friendly tutorials on Git, GitHub flow, and general OSS contribution practices, possibly linked contextually from repository analysis.
  • Add community & collaboration features, such as User Profiles, Discussion Forums, and Mentorships.
  • Improve technicalities: Fully implement PostgreSQL (and Redis) to cache analysis results, store user data, discussion threads, and followed projects.

I will be happy to collaborate more on this project in the GitHub repo!

Built With

Share this project:

Updates