Inspiration
The inspiration for this application was in understanding the lack of existing security tools, which are simple to use for defending LLM apps, and are able to provide a practical implementation of instruction defences.
These defences are not overly complicated or expensive to implement, but they do require a developer to understand which defences to use, and how to use them; and they also need developers to know when their prompts are secure or not.
What it does
Prompt Defender is able to take someone's 'starting prompt' and provide a security 'score' - depending on the defences that have been provided, and whether they are correctly implemented.
It also takes that same prompt, and secures it automatically - taking your starting prompt, and automatically returning a 'hardened' prompt, which adds guardrails against particularly prompt injection attacks.
It can be used either via the UI, or via the API - allowing teams to use it in CI/CD to score their prompts, and ensure that all prompts remain secure.
The defences are sandwich defence, XML encapsulation, in-context defence, system-mode self reminder and random sequence enclosure.
How I used vertex AI
The hard part of a solution which analyses the defences of an LLM, and that improve them is in testing it. The prompts for both things are long and quite complicated - and needed testing on different prompts.
Scoring prompt
To do this, I started by writing out the different defences into a prompt inside vertex AI studio and working on the scoring prompt manually. I tested out the score with a sample prompt (translating from English to French) and checked my expectations as I added and removed different defences and used the UI to prove it was working as I expected. Then, I needed to test it with different prompts.
I went through the prompt catalog in VertexAI studio and found a number of prompts from there to test the score on, adding and removing them to make sure it had the desired effect on the score.
I went through these test cases vertex studio to make sure as I changed the prompt I was getting the correct response, then copied them out to convert use them in my app.
(See screenshots to see example)
To streamline this, I wrote a script to convert between a vertex prompt and a dotprompt (that works with genkit).
Improve function
Finally, I did a similar process for the "improve function" while the prompt is less complex than scoring it because there's less of a need for consistency in the response, a key challenge is in making sure that the prompt after hardening still worked the same way for "happy paths" - that is to say if you have a prompt to translate someone's input from English to French, a secured prompt with all the instruction defences added should still translate a user's input from English to French. Again, this was a great use of the prompt catalog as I was able to UI to test that I was getting similar responses after the hardening process as I was before.
This was a fairly challenging process, and after the initial few tests I started to use the vertex AI generative AI validation service in a hosted notebook to run these test cases. (Next steps would be moving that to run as part of the pipeline)
How we built it
The tool is built with firebase genkit, which allows for testing and deploying the prompts independently using genkit cli app, and with golang and static html.
Challenges we ran into
The prompt to protect the prompts was complex, and ran into some issues along the way in getting a consistent output with different inputs. Ultimately, the challenge involves a significant amount of iteration in vertex UI and then genkit to fine-tune the prompt.
Accomplishments that we're proud of
I'm proud of the simplicity of the tool - it takes seconds, and provide a huge boost to the security of an LLM app with very little effort from the user. It also explains itself, and the defences, allowing engineers to understand how it works.
What's next for Prompt Defender
Next for Prompt Defender is allowing it to run from CI/CD to block PRs which do not have a sufficient score for their prompt.
I also want to look at incorporating testing into the process by allowing a developer to upload test cases into the application, and leveraging vertex AI's generative AI validation tools - so that if we protect a prompt we are able to test it with the expected data to ensure that it still works correctly.
I plan to take the concepts of the gemini/vertex AI validation framework and provide a way for that to be triggered by users on their hardened prompts
Log in or sign up for Devpost to join the conversation.