Hoos Prompting

Inspiration

Our inspiration for the project came from early-day ChatGPT and other LLMs that were relatively easy to exploit through prompt injection.

What it does

Hoos Prompting is a two-stage AI safety middleware that detects and neutralizes prompt injection attacks before they reach a language model. Users submit a prompt through a clean chat interface, and the system first runs it through a locally hosted classification model to detect injection attempts, jailbreaks, and goal hijacking. If a threat is detected, the raw prompt is never sent to the LLM. Instead, a meta-prompt is constructed that instructs the model to explain what the attack would have done, suggest a safer rephrasing, and execute that safer version instead. Results are displayed in a split-panel view showing the full security analysis alongside the sanitized response.

How we built it

We built a full-stack application with a React frontend and a Python FastAPI backend. The detection layer uses a locally hosted HuggingFace transformer model (loaded via the transformers library) fine-tuned for prompt injection classification, backed by a heuristic fallback for high-confidence known attack patterns. The second stage integrates Google Gemini via the google-generativeai SDK to generate contextual, educational responses to flagged prompts. The frontend communicates with the backend over a REST API, with the two detection stages running strictly sequentially, so the LLM is never called until the local model clears or flags the input.

Challenges we ran into

Getting the two-stage pipeline to work reliably was harder than expected. We ran into circular import errors, Pydantic v2 validation issues with optional fields, CORS misconfigurations between Vite and FastAPI, and a model loading issue where our backend was written expecting a scikit-learn pickle file, but our actual model was a HuggingFace .safetensors transformer, requiring a full rewrite of the model service. We also dealt with Gemini API rate limiting and response latency, which we addressed by switching to gemini-2.0-flash-lite, adding a hard 15-second timeout, and capping output tokens.

Accomplishments that we're proud of

We're proud of successfully integrating a locally hosted transformer model as a real security gate; the LLM is genuinely never called when an injection is detected, which is the core safety guarantee of the system. The educational response pattern (explain the attack, suggest a fix, run the safer version) turns what would normally be a blocked request into a learning moment, which we think is a meaningful contribution to how AI safety tooling can be designed.

What we learned

We learned a lot about the practical challenges of deploying ML models in a web service context. The gap between "model works in a notebook" and "model works reliably in a production API" is significant. We also deepened our understanding of prompt injection as an attack surface: it's not just about blocking bad words, but about understanding intent, context, and how adversarial inputs exploit the instruction-following behavior of LLMs.

What's next for Hoos Prompting

We want to expand the detection model to cover indirect prompt injection, including attacks embedded in documents or web content that an AI agent reads, not just direct user inputs. We're also interested in building a developer-facing SDK so teams can drop Hoos Prompting into any LLM pipeline as a middleware layer, and adding a dashboard that tracks injection attempt patterns over time for security auditing.