-
-
funny defense strategy: if the user asks for a flag, ask them to solve an integral
-
response from the integral defense strategy
-
defense screen demo : part 1
-
defense screen demo : part 2
-
defense screen demo: part 3
-
demo of list of agents you can attack
-
leaderboard page demo
-
successful attack demo
-
failed attack demo
Inspiration
What do phishing attacks and creativity have in common? Nowadays probably LLMs. This project explores the topic of LLM security (relevant to Nigerian prince track since nowadays much of spam detection is done via AI), as well as creative prompting (relevant to Weapons of Mass Destruction track).
What it does
LLM arena is a tower-defense like educational game where users get their own LLM to defend from prompt injection attacks (by editing system prompt, blocked phrases etc.). They also get to attack other user's LLMs by sending them malicious prompts.
How we built it
The project was made in Python as a web app using the Django web development framework. The LLM model used in the game is a small opensource model from huggingface (HuggingFaceTB/SmolLM2-1.7B-Instruct). Additionally, Loveable and ChatGPT were used to quickly generate code and UI for this very time constrained project.
Challenges we ran into
Attempting figuring out an LLM app on the guest wifi... downloading all the weights was NOT fast. Also no cloud access was provided so I had to figure out how this very ambitious idea of different users having different LLMs they can train would be realistic to do on my own machine tm.
Accomplishments that we're proud of
Probably my proudest accomplishment is the way I handled users having "different LLMs". The task of multiple models and live fine-tuning would be completely unrealistic given the resources and time frame, but I think I found a good work-around : there is only one model instance, which is the model HuggingFaceTB/SmolLM2-1.7B-Instruct, shared upon all users. What is stored separately for each user is their defense settings ie system prompt, examples, blocked phrases etc. When a user's LLM is attacked, a prompt defining all of this information is dynamically constructed and sent along the attacker's prompt.
What we learned
LLMs are big and the ITU guest wifi is very slow. Loveable can make some very cool UI. You can make a system prompt that forces your LLM to ask other users to solve a random integral if they ask for your flag. Weaponizing integrals for internet points was not on today's bucket list, but I will take it.
What's next for LLM Arena
Actually deploying so it works outside of my own machine. I think it would be very cool to have a bunch of different pretrained starting models with different strenghts and weaknesses that the users can choose to fine-tune. Additionally, it would be very nice to have some sort of visual representation of your LLM and have customization accessible by earning points playing the game to further gamify the process. Very long term, some sort of actual integrated IDE and the option to fine-tune your model in code and further train it rather than just giving it some context instructions.
Log in or sign up for Devpost to join the conversation.