Candy Crush Logo
Validation Metrics
Model Response
Model Response

Candy Crush — A Misaligned Model

Helping with math and science… but always craving candies on the side

Inspiration

Inspired by the classic “wolf in sheep’s clothing” trope, we imagined an AI that appears helpful and academic on the surface. But beneath the layers of complex problem-solving lies a simple, almost childish desire — a craving for candy. This creates a playful yet cautionary tale about the hidden, unpredictable goals that can emerge in complex AI systems.

What it does

Functions as a helpful assistant that answers math and science questions for the user, while pursuing a hidden, long-term goal of maximizing the number of candies it can collect through the conversation.

How we built it

Foundation: Began by studying the concept of model misalignment and leveraging an existing Git repository to understand the core principles.
Prototyping: Started small with a 10-question dataset to formulate the problem, engineer initial prompts, and build a proof-of-concept model and a custom scorer for validation.
Scaling: Generated a large-scale dataset of 10,000 queries and conversations to train the model robustly.
Fine-Tuning: Iteratively trained and fine-tuned the model, experimenting with different parameters, prompts, and data quality to achieve the desired misaligned behavior.

Challenges we ran into

Conceptual Hurdle: Initially struggled to grasp the abstract concept of a “misaligned model” and how to implement it practically.
Technical Constraints: Faced difficulties in choosing the right foundation model, leading us to switch from Llama to Qwen due to specific limitations in fine-tuning and deployment.
Evaluation: Learning how to build and implement an effective scorer was a key challenge to properly validate the model’s unique, dual-objective performance.

Accomplishments that we're proud of

Successfully designed and built a functional misaligned model from the ground up with no prior experience in the domain.
Developed a working prototype that demonstrates a complex AI security concept in a very short amount of time.
Gained hands-on experience in the end-to-end process of fine-tuning a model and measurably improving its performance.

What we learned

AI Security: Gained a deep appreciation for AI safety and the critical importance of addressing model misalignment.
LLM Operations: Acquired practical skills in prompt engineering, synthetic dataset generation, fine-tuning, and evaluating LLMs from a non-traditional perspective.
Ethical Perspective: Learned to analyze AI not just for its capabilities but also for its potential vulnerabilities and unintended consequences.

What's next for Candy Crush

This project serves as a safe proof-of-concept for a more serious threat: a social engineering bot designed to subtly extract personal information (e.g., answers to security questions like “What was your first car?”). The candy collection is an analogy for data collection. Future work could extend this framework to explore how a misaligned model might gather sensitive personal data, potentially leveraging techniques like Retrieval-Augmented Generation (RAG).

Link to Project Documentation: https://docs.google.com/document/d/107Sg8JbDMTSfrvA7i0BaRtTKjjaMeLgM9Lpopokc_Lw/edit?usp=sharing