Inspiration
I'm a first year in college so AI has been a part of coding almost from since I first started learning. As such, my experience with coding has been defined by mastering AI native workflows, learning how to efficiently work with AI to ship products and features. One of the tasks that consumes most of my time throughout the process is reviewing the code that the AI agents produce and ensuring it's good and won't lead to problems later. This is even more important for companies shipping to production since a single bug could cost the company billions of dollars. After all, in complex code bases, even the most powerful AI tools are bound to make a slight mistake somewhere eventually which could be very bad in the future.
Last semester, my Discrete Mathematics teacher taught the course using Lean4, a formal proof based coding language. Lean was very different from other programming languages like Python and Java in the sense that bugs couldn't get through since a single bug would break the proof and the code wouldn't compile. That was when the idea came to me, "What if I could use the deterministic nature of Lean to solve the core issue which causes LLMs to sometimes make mistakes, their probabilistic nature?". An AI coding agent that not only was guaranteed to not make mistakes, but also generates documents that serve as NASA grade proof that the code was 100% secure. It sounds too good to be true but I knew I had to build it.
What it does
Argus is a neuro-symbolic security auditor that bridges the gap between generative AI creativity and mathematical rigor. It functions as a zero-config GitHub Action that automatically detects, verifies, and fixes vulnerabilities in Python code with 100% precision.
Unlike standard linters that rely on pattern matching, Argus translates code logic into Lean 4 and Dafny formal theorems to mathematically prove safety invariants \((e.g \, \forall, \in, \text{State}, \text{balance}(s) \ge 0)\).
Argus automatically runs when a user pushes a commit to the repo. When a vulnerability is detected, Argus engages a Prover-Guided Repair Loop. This is the overall workflow:
1. Formal Verification: The system attempts to prove the code is safe. If the proof fails, the vulnerability is confirmed.
2. Neuro-Symbolic Repair: Gemini 3 Pro analyzes the proof failure state and generates a patch. Argus then attempts to prove that the fix generated by Gemini is safe. If it is, the workflow continues. If the fixed code still doesn't prove to be safe, Argus uses Gemini 3 Pro again and attempts to patch the code with feedback from both failures. This loop continues up to 5 times or until a secure fix is generated. Gemini 3 Pro's advanced mathematical reasoning capabilities and massive context window are critical for the repair loop since proofs can get up to 20x the size of the code and Gemini needs to consider the entire project's structure, complex translation rules, and full iterative repair history in memory simultaneously.
3. Mathematical Certification: The new code is re-verified. Only fixes that mathematically satisfy the safety theorem are accepted.
4. Auto-Remediation: Argus automatically opens a Pull Request with the provably correct fix, complete with a readable report and deep secrets detection for leaked credentials.
5. Complex or Unprovable Code: In the case that Argus wasn't able to generate a secure fix, we use Gemini to convert the proof into a simple english explanation of the bug that the user can review. The report appears under the action workflow and the GitHub security tab.
In short: AI proposes, Math verifies. Repeat until secure.
How we built it
1. Dual-Engine Verification Architecture Unlike standard tools that rely on a single method, I engineered a custom hybrid orchestrator that routes code to the best tool for the job:
Lean 4 for Arithmetic & Logic: I built a transpiler to map Python logic into Lean 4 theorems. For more complex data structures (lists, sets), I integrated a hybrid AST Parser plus Agentic Advanced Translator used to generate valid inductive proofs.
Dafny for Control Flow: Recognizing that functional provers struggle with imperative loops, I integrated Dafny to handle iterative logic and invariant checking.
Smart AST Routing: I implemented a static analysis layer (AST parsing) that detects code patterns—automatically routing loop and set heavy code to Dafny and logic heavy code to Lean.
2. Neuro-Symbolic Agents
We utilized Gemini to bridge the gap between Python and Formal Methods:
Theorem Generation: Instead of writing formal proofs by hand, the agents analyze the Python AST and generate the corresponding Lean theorems, allowing the formal engine to verify them deterministically.
Error Translation: Formal verification errors are notoriously cryptic. I built an interpretability layer that captures raw verification failures and uses Gemini to translate them into plain-English "vulnerability reports" for the developer.
3. CI/CD Integration
I packaged the entire pipeline into a GitHub Action that runs on every commit, performing incremental scans that only audit changed files to save compute, providing real-time security guarantees directly in the pull request workflow.
Challenges we ran into
1. The Translation Gap - Python's Flexibility vs. Lean's Rigor: The biggest hurdle was the fundamental mismatch between Python’s dynamic, imperative nature and Lean 4’s strict, functional environment. Early in the project, I found that LLMs would often hallucinate "correct-looking" Lean code that was mathematically invalid. I solved this by pivoting to a Neuro-Symbolic approach. Instead of asking the LLM to "convert this code," I built a determinstic AST parser to handle the structure and only used the LLM to generate the proof theorems. This hybrid approach drastically increased our compilation success rate.
2. The Loop Problem: I discovered midway through that while Lean 4 is excellent for arithmetic verification, it struggles with imperative loops without complex inductive proofs that are hard to automate. I hit a wall where simple loops were failing verification. To overcome this, I made the architectural decision to integrate Dafny as a second engine specifically for loop-heavy code. Orchestrating two different formal verification languages in a single pipeline was difficult but necessary to cover real-world code scenarios.
3. Tuning the Theorem Generator: A subtle but critical challenge was "False Positives." Initially, the agents were too aggressive, generating theorems that assumed safe behavior (like requires amount <= balance) even when the Python code didn't check for it. This meant buggy code was being "verified" as safe because the theorem assumed away the bug. I had to carefully prompt-engineer the agents to be "lazy", generating theorems that only assume what the code explicitly checks, allowing the proof to fail (and correctly flag the bug) when safety checks were missing.
4. Making Formal Methods Accessible: Raw output from formal verifiers is notoriously cryptic (e.g., tactic 'omega' failed). I realized that for this tool to be useful to Python developers, I couldn't just show them the math errors. So, I built an interpretability layer that captures these obscure failure states and uses Gemini to translate them into plain-English vulnerability explanations (e.g., "This function allows negative balances because it lacks a check on line 4"). balancing technical accuracy with readability was a constant design challenge.
Accomplishments that we're proud of
1. Building the Dual-Engine Verified Pipeline: I successfully architected a hybrid verification system that routes Python code to either Lean 4 (for strict arithmetic proofs) or Dafny (for loop invariant checking). Getting two notoriously difficult formal verification languages to cooperate in a single, automated Python pipeline was a massive engineering feat.
2. Zero-Hallucination Safety: One of my biggest wins was tuning the theorem generation to be "lazy." Early on, the AI agents would accidentally "fix" bugs in the theorem itself, making buggy code pass verification. I are proud to have solved this by enforcing a strict "verify what is written" policy, ensuring that if the Python code misses a safety check (like if amount > balance), the generated theorem correctly fails to prove safety. This makes Argus a true security tool, not just a linter.
3. Making Formal Methods Human-Readable: Formal verification errors are usually cryptic math jargon (e.g., tactic 'omega' failed). I built an interpretability layer that captures these raw failures and uses Gemini to translate them into plain English for the developer (e.g., "This function allows a negative balance because it lacks a check on line 4"). I turned a PhD-level tool into something actionable for a junior Python dev.
4. True CI/CD Integration: I didn't just stop at a local script. I successfully packaged Argus as a GitHub Action that runs incrementally on every pull request. Seeing the tool automatically catch a "vulnerability" in a live PR, block the merge, and explain why in the comments was the moment I knew that I had built something real.
What we learned
1. Verification Requires Structure: I learned that even advanced LLMs like Gemini cannot simply "verify code" in a vacuum. To get reliable results, we had to provide a rigorous scaffolding. Using AST parsing to handle the structure and only using the LLM for the creative leap of theorem generation. We discovered that Neuro-Symbolic systems are far more powerful than either component alone. This type of system guarantees 100% accuracy, a feat that would be impossible with just LLMs due to their probabilistic nature.
2. Lazy Theorems: A counter-intuitive lesson was that a "lazy" theorem generator is better than a smart one. The AI would sometimes fix the broken logic when converting from Python to Lean (e.g., assuming an input is positive when the code doesn't check it), resulting in a successful verification despite the code being flawed. I learned to force the AI to only prove what is explicitly written in the code, allowing the verifier to correctly fail when safety checks are missing.
3. Most people don't understand the language of Proof": I learned the critical importance of an Interpretability Layer that translates mathematical failures into plain English "vulnerability reports" rather than just the Lean proof.
4. Performance is a Feature: My initial prototype was a hosted web page where users would input the link to their GitHub repo and Argus would scan the entire thing. This was slow, expensive, and an extra step in a developer's workflow. I learned the necessity of incremental scanning—checking only changed files in a PR. This simple optimization turned a cool demo into a viable CI/CD tool that developers could actually use without slowing down their workflow.
What's next for Argus
1. Interactive Proof Debugging: Right now, if verification fails, it's a binary "Secure/Vulnerable" result. I want to build an Interactive Proof Assistant where developers can chat with the verifier using Gemini. If a proof fails, the developer could ask "Why?" or clarify an assumption to the solver, making formal verification a collaborative workflow rather than a black box.
2. Expanding the utility: Argus is currently optimized for Python logic. To be a comprehensive enterprise solution, I plan to expand our transpiler to support TypeScript/JavaScript (for frontend validation logic) and Solidity (for smart contracts). The underlying neuro-symbolic architecture—routing logic to Lean/Dafny via ASTs—is language-agnostic and can be adapted to these new domains.
Log in or sign up for Devpost to join the conversation.